Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/metric: HELP text too long for AWS managed prometheus #87112

Closed
antifuchs opened this issue Aug 30, 2022 · 6 comments · Fixed by #87166
Closed

util/metric: HELP text too long for AWS managed prometheus #87112

antifuchs opened this issue Aug 30, 2022 · 6 comments · Fixed by #87166
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@antifuchs
Copy link

antifuchs commented Aug 30, 2022

Describe the problem

I'm setting up AWS managed prometheus (AMP) and ingesting metrics scraped from hosted cockroachdb, via remote_write on a local prometheus server. This works well if I turn off the remote-writing of metrics metadata. Otherwise, the prometheus server remote-writing to AMP shows this error:

ts=2022-08-30T13:15:59.179Z caller=dedupe.go:112 component=remote level=error remote_name=f52d9c url=https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-[redacted]/api/v1/remote_write msg="non-recoverable error while sending metadata" count=500 err="server returned HTTP status 400 Bad Request: metadata 'HELP' value too long: \"Latency histogram for handling a Raft ready.\\n\\nThis measures the end-to-end-latency of the Raft state advancement loop, and\\nin particular includes:\\n- snapshot application\\n- SST ingestion\\n- durably appe\" metric \"raft_process_handleready_latency\""

IOW, it's complaining about this help text: https://github.com/bdarnell/cockroach/blob/741dacf4d1dbab94bc1ac15d20181fb4f0770e54/pkg/kv/kvserver/metrics.go#L577-L595

To Reproduce

  1. Set up CockroachDB cluster in the cloud or elsewhere.
  2. Set up prometheus metrics scraping the "/_status/vars" metrics path from it.
  3. Set up remote_write into AWS managed prometheus
  4. See the error above, see that ingesting the metrics batch fails.

Expected behavior
I would expect all metadata to be possible to import.

Additional data / screenshots

Environment:

  • CockroachDB version released after Sep 14, 2021
  • Client app prometheus v2.36.2 and aws managed prometheus.

Additional context

I'm not sure the fault here really lies with CRDB (and argh, this HELP text is really well-written and useful!), but I wanted to flag this here as CRDB is the one thing with a public bug tracker that can fix it, in case others run into this issue too.

Jira issue: CRDB-19155

@antifuchs antifuchs added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 30, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 30, 2022

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added A-kv-replication Relating to Raft, consensus, and coordination. O-community Originated from the community X-blathers-triaged blathers was able to find an owner T-kv-replication labels Aug 30, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 30, 2022

cc @cockroachdb/replication

@antifuchs
Copy link
Author

I have a work-around for the issue, but it's a bit annoying: That is, to turn off the sending of all metric metadata in the remote_write portion, via

metadata_config:
  send: false

@erikgrinaker
Copy link
Contributor

Thanks for reporting! The given help text is 1122 bytes, so I'm guessing that their limit is 1024 bytes or so (could you confirm this?). That seems like a reasonable maximum length to use for these help texts in any case, so we'll try to pare it down and enforce a 1024-byte maximum for these texts.

@erikgrinaker erikgrinaker changed the title prometheus HELP text for some metrics is too long for AWS managed prometheus util/metric: HELP text too long for AWS managed prometheus Aug 31, 2022
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 31, 2022

Posted a PR in #87166, but would be great if you could confirm that 1024 bytes is actually within their limit.

@antifuchs
Copy link
Author

Thank you so much for the super quick turn-around! I found a public bug tracker for AMP and filed a request for docs / improvement there!

craig bot pushed a commit that referenced this issue Sep 19, 2022
86603: changefeedccl: redact user from confluent schema registry r=jayshrivastava a=jayshrivastava

This change redacts the confluent schema registry key
from the jobs table.

Fixes: #85902

Release justification: This change is important because
it prevents a user's secret key from being store in the DB.
The footprint of this change is small as it only touches
a specific changefeed option - confluent schema registry.

Release note (enterprise change): Previously, SHOW CHANGEFEED
JOBS would reveal confluent schema registry user information,
including a user's secret key. This change makes this info
redacted, meaning it will not be stored in CockroachDB
internal tables at all.

87166: kvserver: shorten `raft.process.handleready.latency` help text r=tbg a=erikgrinaker

We should get confirmation in #87112 that this size is below the limit before merging this.

---

AWS managed Prometheus rejects `raft.process.handleready.latency`
because the help text is too long. The text is currently 1123 bytes, so
the limit is suspected to be 1024 bytes. This patch reduces the size of
this help text to 938 bytes.

Resolves #87112.

Release justification: bug fixes and low-risk updates to new functionality

Release note (ops change): Reduced the length of the
`raft.process.handleready.latency` metric help text to avoid it being
rejected by certain Prometheus services.

87535: sql: support `#typeHints` greater than `#placeholders` for prepare stmt r=rafiss a=ZhouXing19

Previous, we only support pgwire prepare stmt with the number of typehints equal or smaller than the number of the placeholders in the query. E.g. the following usage are not supported:

```
Parse {"Name": "s2", "Query": "select $1", "ParameterOIDs":[1043, 1043, 1043]}
```
Where there is 1 placeholder in the query, but 3 type hints.

This commit is to allow mismatching #typeHints and #placeholders. The former can be larger than the latter now.

This feature is needed for [CRDB's support for Hasura GraphQL Engine](hasura/graphql-engine#8839 (comment)).

Release justification: Low risk, high benefit changes to existing functionality

Release note: For pgwire-level prepare statements, add support for the case where the number of the type hints is greater than the number of placeholders in the given query.

87870: kvnemesis: reset op.Result r=erikgrinaker a=tbg

We found out (in #87814) after a wild goose chase that op.Result was not
reset, so unless operations were cognizant of this fact and
always fully repopulated the Result field, weird failures
could result from txn retries.

Reset the field.

Release note: None


88078: ui: update filter labels r=maryliag a=maryliag

Update filter label from "App" to "Application Name" and "Username" to "User Name" on SQL Activity and Insights pages.

Fixes #87960

<img width="467" alt="Screen Shot 2022-09-16 at 6 40 51 PM" src="https://user-images.githubusercontent.com/1017486/190827281-36a9c6df-3e16-4689-bcae-6649b1c2ff86.png">


Release note (ui change): Update filter labels from "App" to "Application Name" and from "Username" to "User Name" on SQL Activity and Insights pages.

88111: sql: fix `pg_get_viewdef` for materialized views r=rafiss a=knz

Fixes #88109.

Release note (bug fix): The function `pg_catalog.pg_get_viewdef` now works properly for materialized views.

Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Jane Xing <zhouxing@uchicago.edu>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@craig craig bot closed this as completed in #87166 Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants