Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: sql: use adaptive sampling rate for telemetry logging #70960

Merged
merged 1 commit into from
Sep 30, 2021

Conversation

xinhaoz
Copy link
Member

@xinhaoz xinhaoz commented Sep 30, 2021

Backport 1/1 commits from #70786.

/cc @cockroachdb/release


Resolves #70553

Release justification: category 4

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
sql.telemetry.query_sampling.qps_threshold, and
sql.telemetry.query_sampling.sample_rate have been removed.
A new setting, sql.telemetry.query_sampling.max_event_frequency
has been introduced, with default value of 10 events per second.


roachrod results

cluster created with:

roachprod create $CLUSTER -n 4 --gce-machine-type=n1-standard-16

results of tpcc 1000

Telemetry OFF

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151075           21.0     53.9     48.2     67.1    109.1   5905.6  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509788          209.7     42.7     25.2     39.8     83.9   9663.7  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151161           21.0      7.8      5.0      7.6     13.6   5905.6  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1511404          209.9     22.4     13.6     23.1     46.1   9663.7  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151079           21.0     14.2     10.5     17.8     27.3   7247.8  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3474507          482.6     31.6     19.9     46.1     75.5   9663.7
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12581.5  97.8%     42.7     25.2     33.6     39.8     83.9   9663.7

Telemetry ON

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         150953           21.0     61.9     54.5     71.3     96.5   5637.1  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509498          209.7     36.8     26.2     39.8     60.8   6174.0  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151110           21.0      9.2      5.5      7.9     14.2   4831.8  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1510864          209.8     34.0     14.2     23.1     35.7   8321.5  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151026           21.0     17.0     11.5     18.9     29.4   5637.1  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3473451          482.4     34.6     21.0     46.1     67.1   8321.5
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12579.1  97.8%     36.8     26.2     33.6     39.8     60.8   6174.0

@blathers-crl
Copy link

blathers-crl bot commented Sep 30, 2021

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Resolves cockroachdb#70553

Release justification: category 4

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.
@xinhaoz xinhaoz marked this pull request as ready for review September 30, 2021 21:45
Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants