Permalink
Newer
100644
350 lines (273 sloc)
17.3 KB
1
- Feature Name: Time Series Culling
2
- Status: in progress
3
- Start Date: 2016-08-29
4
- Authors: Matt Tracy
5
- RFC PR: #9343
6
- Cockroach Issue: #5910
7
8
# Summary
9
Currently, Time Series data recorded by CockroachDB for its own internal
10
metrics is retained indefinitely. High-resolution metrics data quickly loses
11
utility as it ages, consuming disk space and creating range-related overhead
12
without conferring an appropriate benefit.
13
14
The simplest solution to deal with this would be to build a system that deletes
15
time series data older than a certain threshold; however, this RFC suggests a
16
mechanism for "rolling up" old time series from the system into a lower
17
resolution that is still retained. This will allow us to keep some metrics
18
information indefinitely, which can be used for historical performance
19
evaluation, without needing to keep an unacceptably expensive amount of
20
information.
21
22
Fully realizing this solution has three components:
23
24
1. A distributed "culling" algorithm that occasionally searches for
25
high-resolution time series data older than a certain threshold and runs a
26
"roll-up" process on the discovered keys.
27
2. A "roll-up" process that computes low-resolution time series data from the
28
existing data in a high-resolution time series key, deleting the high-resolution
29
key in the process.
30
3. Modifications to the query system to utilize underlying data which is stored
31
at multiple resolutions (currently only supports a single resolution). This
32
includes the use of data at different resolutions to serve a single query.
33
34
# Motivation
35
36
In our test clusters, time series create a very large amount of data (on the
37
order of several gigabytes per week) which quickly loses utility as it ages.
38
39
To estimate how much data this is, we first observe the data usage of a single
40
time series. A single time series stores data as contiguous samples representing
41
ten-second intervals; all samples for a wall-clock hour are stored in a single
42
key. In the engine, the keys look like this:
43
44
| Key | Key Size | Value Size |
45
|----------------------------------------------------------|----------|------------|
46
| /System/tsd/cr.store.replicas/1/10s/2016-09-26T17:00:00Z | 30 | 5670 |
47
| /System/tsd/cr.store.replicas/1/10s/2016-09-26T18:00:00Z | 30 | 5535 |
48
| /System/tsd/cr.store.replicas/1/10s/2016-09-26T19:00:00Z | 30 | 5046 |
49
50
The above is the data stored for one time series over three complete hours.
51
Notice the variation in the size of the values; this is due to the fact that
52
samples may be absent for some ten-second periods, due to the asynchronous
53
nature of this system. For our purposes, we will estimate the size of a single
54
hour of data for a single time series to be *5500* bytes, or 5.5K.
55
56
The total disk usage of high-resolution data on the cluster can thus be
57
estimated with the following function:
58
59
` Total bytes = [bytes per time series hour] * [# of time series per node] * [# of nodes] * [# of hours] `
60
61
Thus, data accumulates over time, and as more nodes are added (or if later
62
versions of cockroach add additional time series), the rate of new time series
63
data being accumulated increases linearly. As of this writing, each single-node store records
64
**242** time series. Thus, the bytes needed per hour on a ten-node cluster is:
65
66
`Total Bytes (hour) = 5500 * 242 * 10 = 13310000 (12.69 MiB)`
67
68
After just one week:
69
70
`Total Bytes (week) = 12.69MiB * 168 hours = 2.08 GiB`
71
72
As time passes, this data can represent a large share (or in the case of idle
73
clusters, the majority) of in-use data on the cluster. This data will also
74
continue to build indefinitely; a static CockroachDB Cluster will eventually
75
consume all available disk space, even if no external data is written! With just
76
the current time series, a ten-node cluster will generate almost a terabyte of
77
metrics data over a single year.
78
79
The prompt culling of old data is thus a clear area of improvement for
80
CockroachDB. However, rather than simply deleting data older than a threshold,
81
this RFC proposes a solution which efficiently keeps metrics data for a longer
82
time span by downsampling it to a much lower resolution on disk.
83
84
To give some context of numbers: currently, all metrics on disk are stored in a
85
format which is downsampled to _ten second sample periods_; this is the
86
"high-resolution" data. We are looking to delete this data when it is older
87
than a certain threshold, which will likely be set in the range of _2-4 weeks_.
88
We also propose that, when this data is deleted, it is first downsampled further
89
into _one hour sample periods_; this is the "low-resolution" data. This data
90
will be kept for a much longer time, likely _6-12 months_, but perhaps longer.
91
92
In the lower resolution, each datapoint represents the same data as an _entire
93
slab_ of high-resolution data (at the ten second resolution, data is stored in
94
slabs corresponding to a wall-clock hour; each slab contains up to 360 samples).
95
Thus, the expected data storage of the low-resolution is approximately _180x
96
smaller_ than the high-resolution (not 360 because the individual low-resolution
97
samples will include a "min" and "max" value not present at the high-resolution.
98
The high-resolution keys only contain a "sum" and "count" field.)
99
100
By keeping data at the low resolution, users will still be able to inspect
101
cluster performance over larger times scales, without requiring the storage of
102
an excessive amount of metrics data.
103
104
# Detailed design
105
106
## Culling algorithm
107
108
The culling algorithm is responsible for identifying high-resolution time series
109
keys that are older than a system-set threshold. Once identified, the keys are
110
passed into the rollup/delete process.
111
112
There are two primary design requirements of the culling algorithm:
113
114
1. From a single node, efficiently locating time series keys which need to be
115
culled.
116
2. Across the cluster, efficiently distributing the task of culling with minimal
117
coordination between nodes.
118
119
#### Locating Time Series Keys
120
121
Locating time series keys to be culled is not completely trivial due to the
122
construction of time series keys, which is thus:
123
`[ts prefix][series name][timestamp][source]`
124
125
> Example: "ts/cr.node.sql.inserts/1473739200/1" would contain time series data
126
> for "cr.node.sql.inserts" on September 13th 2016 between 4am-5am UTC,
127
> specifically for node 1.
128
129
Because of this construction, which prioritizes name over timestamp, the most
130
recent time series data for series "A" would sort *before* the oldest time
131
series data for series "B". This means that we cannot simply cull the beginning
132
of the time series range.
133
134
The simplest alternative would be to scan the time series range looking for
135
invalid keys; however, this is considered to be a burdensome scan due to the
136
number of keys that are not culled. For a per-node time series being recorded on
137
a 10 node cluster with a 2 week retention period, we would expect to retain (10
138
x 24 x 14) = *3360* keys that should not be culled. In a system that maintains
139
dozens, possibly hundreds of time series, this is a lot of data for each node to
140
scan on a regular basis.
141
142
However, this scan can be effectively distributed across the cluster by creating
143
a new *replica queue* which searches for time series keys. The new queue can
144
quickly determine if each range contains time series keys (by inspecting
145
start/end keys); for ranges that do contain time series keys, specific keys
146
can then be inspected at the engine level. This means that key inspections do
147
not require network calls, and the number of keys that can be inspected at once
148
is limited to the size of a range.
149
150
Once the queue discovers a range that contains time series keys, the scanning
151
process does not need to inspect every key on the range. The algorithm is as
152
follows:
153
154
1. Find the first time series key in the range (scan for [ts prefix]).
155
2. Deconstruct the key to retrieve its name.
156
3. Run the rollup/delete operation on all keys in the range:
157
`[ts prefix][series name][0] - [tsprefix][series name][now - threshold]`
158
4. Find the next key on the range which contains data for a different time
159
series by searching for key `PrefixEnd([ts prefix][series name])`.
160
5. If a key was found in step 5, return to step 2 with that series name.
161
162
This algorithm will avoid scanning keys that do not need to be rolled up; this
163
is desirable, as once the culling algorithm is in place and has run once, the
164
majority of time series keys will *not* need to be culled.
165
166
The queue will be configured to run only on the range leader for a given range
167
in order to avoid duplicate work; however, this is *not* necessary for
168
correctness, as demonstrated in the [Rollup Algorithm](#rollup-algorithm)
169
section below.
170
171
The queue will initially be set to process replicas at the same rate as the
172
replica GC queue (as of this RFC, one range per 50 milliseconds).
173
174
##### Package Dependency
175
176
There is one particular complication to this method: *go package dependency*.
177
Knowledge on how to identify and cull time series keys is contained in the `ts`
178
package, but all logic for replica queues (and all current queues) lives in
179
`storage`, meaning that one of three things must happen:
180
181
+ `storage` can depend on `ts`. This seems to be trivially possible now, but may
182
be unintuitive to those trying to understand our code-base. For reference, the
183
`storage` package used to depend on the `sql` package in order to record event
184
logs, but this eventually became an impediment to new development and had to be
185
modified.
186
+ The queue logic could be implemented in `ts`, and `storage` could implement
187
an interface that allows it to use the `ts` code without a dependency.
188
+ Parts of the `ts` package could be split off into another package that can
189
intuitively live below `storage`. However, this is likely to be a considerable
190
portion of `ts` in order to properly implement rollups.
191
192
Tenatively, we will be attempting to use the first method and have `storage`
193
depend on `ts`; if it is indeed trivially possible, this will be the fastest
194
method of completing this project.
195
196
#### Culling low resolution data
197
198
Although the volume is much lower, low-resolution data will still build
199
up indefinitely unless it is culled. This data will also be culled by the same
200
algorithm outlined here; however, it will not be rolled up further, but will
201
simply be deleted.
202
203
## Rollup algorithm
204
205
The rollup algorithm is intended to be run on a single high-resolution key
206
identified by the culling algorithm. The algorithm is as follows:
207
208
1. Read the data in the key. Each key represents a "slab" of high resolution
209
samples captured over a wall-clock hour (up to 360 samples per hour).
210
2. "Downsample" all of the data in the key into a single sample; the new sample
211
will have a sum, count, min and max, computed from the samples in the original
212
key.
213
3. Write the computed sample as a low-resolution data point into the time series
214
system; this is exactly the same process as currently recorded time series,
215
except it will be writing to a different key space (with a different key
216
prefix).
217
4. Delete the original high-resolution key.
218
219
This algorithm is safe to use, even in the case where the same key is being
220
culled by multiple nodes at the same time; this is because step 3 and 4 are
221
currently *idempotent*. The low-resolution sample generated by each node will be
222
identical, and the engine-level time series merging system currently discards
223
duplicate samples. The deletion of the high-resolution key may cause an error on
224
some of the nodes, but only because the key will have already been deleted.
225
226
The end result is that the culled high-resolution key is gone, but a single
227
sample (representing the entire hour) has been written into a low-resolution
228
time series with the same name and source.
229
230
## Querying Across Culling Boundary
231
232
The final component of this is to allow querying across the culling boundary;
233
that is, if an incoming time series query wants data from both sides of the
234
culling boundary, it will have to process data from two different resolutions.
235
236
There are no broad design decisions to make here; this is simply a matter
237
of modifying low-level iterators and querying slightly different data. This
238
component will likely be the most complicated to actually *write*, but it should
239
be somewhat easier to *test* than the above algorithms, as there is already
240
an existing test infrastructure for time series queries.
241
242
## Implementation
243
244
This system can (and should) be implemented in three distinct phases:
245
246
1. The "culling" algorithm will be implemented, but will not roll-up the data in
247
discovered keys; instead, it will simply *delete* the discovered time series by
248
issuing a DeleteRange command. This will provide the immediate benefit of
249
limiting the growth of time series data on the cluster.
250
251
2. The "rollup" algorithm will be implemented, generating low-resolution data
252
before deleting the high-resolution data. However, the low-resolution data will
253
not immediately be accessible for queries.
254
255
3. The query system will be modified to consider the high-resolution data.
256
257
# Drawbacks
258
259
+ Culling represents another periodic process which runs on each node, which can
260
occasionally cause unexpected issues.
261
262
+ Depending on the exact layout of time series data across ranges, it is
263
possible that deleting time series could result in empty ranges. Specifically,
264
this can occur if a range contains data only for a single time series *and* the
265
subsequent range also contains data for that same time series. If this is a
266
common occurrence, it could result in a "trail" of ranges with no data, which
267
might add overhead into storage algorithms that scale with the number of ranges.
268
269
# Alternatives
270
271
### Alternative Location algorithm
272
273
As an alternative to the queue-based location algorithm, we could use a system
274
where each node maintains a list of time series it has written; given the name
275
of a series, it is easy to construct a scan range which will return all keys
276
that need to be culled:
277
278
`[ts prefix][series name][0] - [ts prefix][series name][(now - threshold)]`
279
280
This will return all keys in the series which are older than the threshold. Note
281
that this includes time series keys generated by any node, not just the current
282
node; this is acceptable, as the rollup algorithm can be run on any key from
283
any node.
284
285
This process can also be effectively distributed across nodes with the following
286
algorithm:
287
288
+ Each node's time series module maintains a list of time series it is
289
responsible for culling. This is initialized to a list of "retired" time series,
290
and is augmented each time the node writes a time series it has not written
291
before (in the currently running instance).
292
+ The time series module maintains a random permutation of the list; this
293
permutation is randomized again each time a new time series is added. This
294
should normalize very quickly, as new time series are not currently added while
295
a node is running.
296
+ Each node will periodically attempt to cull data for a single time series;
297
this starts with the first name in the current permutation, and proceeds through
298
it in a loop.
299
300
In this way, each node eventually attempts to cull all time series (guaranteeing
301
that each is culled), but the individual nodes proceed through the series in a
302
random order - this helps to distribute the work across nodes, and helps to
303
avoid the chance of duplicate work. The total speed of work can be tuned by
304
adjusting the frequency of the per-node culling process.
305
306
This alternative was rejected due to a complication that occurs when a time
307
series is "retired"; we only know about a time series name if the currently
308
running process has recorded it. If a time series is removed from the system,
309
its data will never be culled. Thus, we must also maintain a list of *retired*
310
time series names in the event that any are removed. This requires some manual
311
effort on the part of developers; the consequences for failing to do so are not
312
especially severe (a limited amount of old data will persist on the cluster),
313
but this is still considered inferior to the queue-based solution.
314
315
### Immediate Rollups
316
317
This was the original intention of the time series system: when a
318
high-resolution data sample is recorded, it is actually directly merged into
319
both the high-resolution AND the low-resolution time series. The engine-level
320
time series merging system would then be responsible for properly aggregating
321
multiple high-resolution samples into a single composite sample in the
322
low-resolution series.
323
324
The advantage of this method is that it does not require queries to use multiple
325
resolutions, and it allows for the delete-only culling process to be used. This
326
was also the original design of the time series system.
327
328
Unfortunately, it is not currently possible due to recent changes which were
329
required by the replica consistency checker. The engine-level merge component no
330
longer aggregates samples, it decimates (discarding only the most recent sample
331
for a period). This was necessary to deal with the unforunate reality of raft
332
command replays.
333
334
### Opportunistic Rollups
335
336
Instead of rolling up when low-resolution data is deleted, it is instead rolled
337
up as soon as an entire hour of high-resolution samples has been collected in a
338
key. That is, at 5:01 it should be appropriate to roll-up the data stored in the
339
4:00 key. With this alternative, cross-resolution queries can also be avoided
340
and the delete-only culling method can be used.
341
342
However, this introduces additional complications and drawbacks:
343
344
+ When querying at low resolution, data from the most recent hour will not be
345
even partially available.
346
+ This requires maintaining additional metadata on the cluster about which
347
keys have already been rolled up.
348
349
# Unresolved questions