release-21.2: sql: use adaptive sampling rate for telemetry logging #70960

xinhaoz · 2021-09-30T21:44:40Z

Backport 1/1 commits from #70786.

/cc @cockroachdb/release

Resolves #70553

Release justification: category 4

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
sql.telemetry.query_sampling.qps_threshold, and
sql.telemetry.query_sampling.sample_rate have been removed.
A new setting, sql.telemetry.query_sampling.max_event_frequency
has been introduced, with default value of 10 events per second.

roachrod results

cluster created with:

roachprod create $CLUSTER -n 4 --gce-machine-type=n1-standard-16

results of tpcc 1000

Telemetry OFF

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151075           21.0     53.9     48.2     67.1    109.1   5905.6  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509788          209.7     42.7     25.2     39.8     83.9   9663.7  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151161           21.0      7.8      5.0      7.6     13.6   5905.6  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1511404          209.9     22.4     13.6     23.1     46.1   9663.7  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151079           21.0     14.2     10.5     17.8     27.3   7247.8  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3474507          482.6     31.6     19.9     46.1     75.5   9663.7
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12581.5  97.8%     42.7     25.2     33.6     39.8     83.9   9663.7

Telemetry ON

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         150953           21.0     61.9     54.5     71.3     96.5   5637.1  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509498          209.7     36.8     26.2     39.8     60.8   6174.0  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151110           21.0      9.2      5.5      7.9     14.2   4831.8  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1510864          209.8     34.0     14.2     23.1     35.7   8321.5  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151026           21.0     17.0     11.5     18.9     29.4   5637.1  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3473451          482.4     34.6     21.0     46.1     67.1   8321.5
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12579.1  97.8%     36.8     26.2     33.6     39.8     60.8   6174.0

blathers-crl · 2021-09-30T21:44:42Z

cockroach-teamcity · 2021-09-30T21:44:47Z

This change is

Resolves cockroachdb#70553 Release justification: category 4 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second.

maryliag

Reviewable status: complete! 1 of 0 LGTMs obtained

xinhaoz force-pushed the backport21.2-70786 branch from 041674f to 44761f8 Compare September 30, 2021 21:45

xinhaoz marked this pull request as ready for review September 30, 2021 21:45

maryliag approved these changes Sep 30, 2021

View reviewed changes

xinhaoz merged commit 57bb904 into cockroachdb:release-21.2 Sep 30, 2021

cockroach-teamcity mentioned this pull request Sep 30, 2021

release-21.2: sql: use adaptive sampling rate for telemetry logging cockroachdb/docs#11850

Closed

tbg mentioned this pull request Nov 22, 2021

kvserver: increase Migrate application timeout to 1 minute #72987

Merged

xinhaoz deleted the backport21.2-70786 branch January 26, 2022 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.2: sql: use adaptive sampling rate for telemetry logging #70960

release-21.2: sql: use adaptive sampling rate for telemetry logging #70960

xinhaoz commented Sep 30, 2021 •

edited

Loading

blathers-crl bot commented Sep 30, 2021 •

edited by xinhaoz

Loading

cockroach-teamcity commented Sep 30, 2021

maryliag left a comment

release-21.2: sql: use adaptive sampling rate for telemetry logging #70960

release-21.2: sql: use adaptive sampling rate for telemetry logging #70960

Conversation

xinhaoz commented Sep 30, 2021 • edited Loading

roachrod results

Telemetry OFF

Telemetry ON

blathers-crl bot commented Sep 30, 2021 • edited by xinhaoz Loading

cockroach-teamcity commented Sep 30, 2021

maryliag left a comment

Choose a reason for hiding this comment

xinhaoz commented Sep 30, 2021 •

edited

Loading

blathers-crl bot commented Sep 30, 2021 •

edited by xinhaoz

Loading