rfcs: new RFC and tech note about time series for tenants#86524
rfcs: new RFC and tech note about time series for tenants#86524craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
53bb900 to
86457a3
Compare
54ecea9 to
b644702
Compare
b644702 to
a214db5
Compare
abarganier
left a comment
There was a problem hiding this comment.
Great work on the tech note and proposal
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @andreimatei, @dhartunian, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 249 at r1 (raw file):
- etc. They will also be suffixed with the SQL instance ID, for example:
Thinking about this in a DB Console user context - do you envision a charts page where we allow secondary tenant users to dimension metrics by SQL instance ID? Or instead, to "hide" the concept of multiple SQL instances via aggregations?
Code quote:
They will also be suffixed with the SQL instance ID, for example:docs/RFCS/20220821_tenant_timeseries.md line 259 at r1 (raw file):
the SQL instance ID and node ID will match exactly and there will be no change in cardinality. ## Automatic classification of metrics
Thanks for addressing this!
Code quote:
## Automatic classification of metricsdocs/RFCS/20220821_tenant_timeseries.md line 277 at r1 (raw file):
We will introduce the ID as prefix only for metrics defined in the 2nd registry (as explained below), leavint hose from the 1st registry unchanged.
nit: typo - "leaving those"
Suggestion:
leavint hosedocs/tech-notes/timeseries.md line 1 at r1 (raw file):
# The internal time series database (tsdb)
This is a wonderful piece of community service you've done writing this - thank you so much!
Code quote:
# The internal time series database (tsdb)
ajstorm
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @dhartunian, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):
2. then, even if hypothetically we did start the tsdb collector, we would then see *ambiguity in all the SQL timeseries*.
Do we have a sense for how significant of a problem this will turn out to be in practice? I'd think that the number of SQL statements being run from the system/admin tenant would be a small fraction of those executed in the app tenant. With that in mind, are there specific examples you can cite where the UX is confusing?
docs/RFCS/20220821_tenant_timeseries.md line 165 at r1 (raw file):
data points. ## Objective 2: desired semantics
I found it a bit confusing that the objectives were split by the "Problems with..." section. Does it make sense to cover the "Problems with..." topic first, and then cover the two objectives back-to-back?
docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):
able to see historical metrics for the app tenant's SQL layer. - By default, users accessing the DB Console for the application
Can you explain the motivation for this one? I'd think that users access the DB Console would need some way to access the storage metrics, but that method could be to toggle to the system tenant and view them there.
docs/RFCS/20220821_tenant_timeseries.md line 176 at r1 (raw file):
- By default, users accessing the DB Console for the application tenant must be able to see the storage metrics (KV-level) too, however an option must exist to revoke this access: in CC Dedicated,
CC Dedicated and Serverless?
docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):
- we will introduce a tenant ID prefix for only those timeseries that are scoped to a tenant (i.e. KV-level timeseries will not be prefixed, but SQL-level timeseries will).
Can you explain the motivation for not adding a tenant prefix for KV-level timeseries data? Is it because it's largely unnecessary? Helps with upgrades?
docs/RFCS/20220821_tenant_timeseries.md line 343 at r1 (raw file):
- SQL metrics will be exported separately, one per tenant, with the tenant *name* (not ID) included as a metric label
Can you add a comment here as to why we're going with tenant name and not tenant ID? Is there any concern down the road that we may have to change the name and this will create an upgrade challenge? Alternatively, if we ever allow true multi-tenancy (or even one additional tenant) within a shared process, will each tenant have to be named?
docs/RFCS/20220821_tenant_timeseries.md line 459 at r1 (raw file):
matches the tenant ID prefix in the timeseries name. The capability WRITE_TENANT_TIMESERIES will not be granted for CC
Does this imply that for Serverless, there will continue to be no SQL metrics written for tenants? If so, do you see this as a temporary state? Or do you think that long-term we'd not want to generate SQL metrics for tenants?
docs/RFCS/20220821_tenant_timeseries.md line 616 at r1 (raw file):
It also introduces the option (but not the mandate) to make separate SQL pods able to record their timeseries in KV, via the new capability WRITE_TENANT_TIMESERIES.
Is this also required to make this proposal work for Serverless? If so, perhaps you want to motivate the need for this further by leaning on that additional use case.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @andreimatei, @dhartunian, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Can you explain the motivation for not adding a tenant prefix for KV-level timeseries data? Is it because it's largely unnecessary? Helps with upgrades?
What would it mean to add a tenant prefix for kv-level timeseries data? There are some kv metrics would could be viewed as tenant scoped, in which case, we could publish tenant-level timeseries data with such a prefix, but the majority of KV metrics don't map nicely to tenants.
docs/RFCS/20220821_tenant_timeseries.md line 343 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Can you add a comment here as to why we're going with tenant name and not tenant ID? Is there any concern down the road that we may have to change the name and this will create an upgrade challenge? Alternatively, if we ever allow true multi-tenancy (or even one additional tenant) within a shared process, will each tenant have to be named?
We could choose to export use both name and ID. It would not affect the cardinality of metrics and it would allow a future where we export tenant metrics without a tenant_name label. Note that kvserver already exports metrics with a tenant_id label to prometheus
docs/RFCS/20220821_tenant_timeseries.md line 641 at r1 (raw file):
# Unresolved questions N/A
I think we may as well list metrics related to non-app secondary tenants as unresolved.
a214db5 to
ca62873
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @andreimatei, @dhartunian, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Do we have a sense for how significant of a problem this will turn out to be in practice? I'd think that the number of SQL statements being run from the system/admin tenant would be a small fraction of those executed in the app tenant. With that in mind, are there specific examples you can cite where the UX is confusing?
If a customer comes to us and says "I didn't run this statement, and yet the counter is increasing, how can I even trust this counter?" what do you propose we tell them?
docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Can you explain the motivation for this one? I'd think that users access the DB Console would need some way to access the storage metrics, but that method could be to toggle to the system tenant and view them there.
Good Q. After all I too was personally where you are here.
-
The proposal here stems from a strong request from the O11y team / PM to offer a DB console experience that's a "drop in" replacement for the single-tenancy experience. They think that mandating that toggle for users who are evolving from single-tenancy is adding too much friction.
-
We've been requested to offer CC Dedicated users the ability to view KV-level metrics, as some of these metrics can be used for performance optimizations. Since Dedicated users won't have access to system tenant db console, we need to achieve this in a different way.
docs/RFCS/20220821_tenant_timeseries.md line 176 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
CC Dedicated and Serverless?
Both indeed.
docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):
There are some kv metrics would could be viewed as tenant scoped
Andrew that's interesting, do you have some examples in mind?
docs/RFCS/20220821_tenant_timeseries.md line 249 at r1 (raw file):
we allow secondary tenant users to dimension metrics by SQL instance ID?
I was hoping that we would trick the DB console into considering the SQL instance ID as a node ID, without any change in the code. And then the UX would be the same with SQL instances as nodes currently.
docs/RFCS/20220821_tenant_timeseries.md line 277 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
nit: typo - "leaving those"
Done.
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @ajwerner, @andreimatei, @dhartunian, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 459 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Does this imply that for Serverless, there will continue to be no SQL metrics written for tenants? If so, do you see this as a temporary state? Or do you think that long-term we'd not want to generate SQL metrics for tenants?
I have no opinion on this. The RFC aims to introduce the mechanisms, it's then for PM to decide how to use them.
docs/RFCS/20220821_tenant_timeseries.md line 616 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Is this also required to make this proposal work for Serverless? If so, perhaps you want to motivate the need for this further by leaning on that additional use case.
I don't think we have pressure either way, so I don't think the RFC needs to mention that here.
(A RFC's purpose is not to discuss product-level trade-offs.)
docs/RFCS/20220821_tenant_timeseries.md line 641 at r1 (raw file):
Previously, ajwerner wrote…
I think we may as well list metrics related to non-app secondary tenants as unresolved.
What do you mean?
ajstorm
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, @andreimatei, @dhartunian, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):
Previously, knz (kena) wrote…
If a customer comes to us and says "I didn't run this statement, and yet the counter is increasing, how can I even trust this counter?" what do you propose we tell them?
Yeah, I could see how it could be confusing from an "app tenant" perspective once SH/Dedicated customers start internalizing the separation of roles between the system and app tenants. My question was more around the short term confusion for early adopters of the UA who haven't yet switched mindsets to an isolated world.
docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):
Previously, knz (kena) wrote…
Good Q. After all I too was personally where you are here.
The proposal here stems from a strong request from the O11y team / PM to offer a DB console experience that's a "drop in" replacement for the single-tenancy experience. They think that mandating that toggle for users who are evolving from single-tenancy is adding too much friction.
We've been requested to offer CC Dedicated users the ability to view KV-level metrics, as some of these metrics can be used for performance optimizations. Since Dedicated users won't have access to system tenant db console, we need to achieve this in a different way.
Interesting. I was unaware of (1), and didn't see it in my last review of the DB Console design for UA. Is this notion formalized somewhere? (2) is new to me too, but makes sense.
docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):
What would it mean to add a tenant prefix for kv-level timeseries data?
I was just thinking that there might be value to prefixing KV metrics with 0, to denote that they were emitted by the system tenant. Are there any benefits to being explicit here instead of implicit?
docs/RFCS/20220821_tenant_timeseries.md line 343 at r1 (raw file):
Previously, ajwerner wrote…
We could choose to export use both name and ID. It would not affect the cardinality of metrics and it would allow a future where we export tenant metrics without a tenant_name label. Note that kvserver already exports metrics with a
tenant_idlabel to prometheus
That SGTM. I do worry that requiring a name could be fragile.
docs/RFCS/20220821_tenant_timeseries.md line 459 at r1 (raw file):
Previously, knz (kena) wrote…
I have no opinion on this. The RFC aims to introduce the mechanisms, it's then for PM to decide how to use them.
That makes sense. We might want to soften the language here to say "...will not initially be granted for CC Serverless tenant, but may be granted at some point down the road".
86563: ts: fix the pretty-printing of tsd keys r=abarganier a=knz Found while working on #86524. Release justification: bug fix Release note (bug fix): When printing keys and range start/end boundaries for time series, the displayed structure of keys was incorrect. This is now fixed. 86904: sql: allow mismatch type numbers in `PREPARE` statement r=rafiss a=ZhouXing19 Previously, we only allow having the same number of parameters and placeholders in a `PREPARE` statement. This is not compatible with Postgres14's behavior. This commit is to loosen the restriction and enable this compatibility. We now take `max(#placeholders, #parameters)` as the true length of parameters of the prepare statement. For each parameter, we first look at the type deduced from the query stmt. If we can't deduce it, we take the type hint for this param. I.e. we now allow queries such as ``` PREPARE args_test_many(int, int) as select $1 // 2 parameters, but only 1 placeholder in the query. PREPARE args_test_few(int) as select $1, $2::int // 1 parameter, but 2 placeholders in the query. ``` fixes #86375 Release justification: Low risk, high benefit changes to existing functionality Release note: allow mismatch type numbers in `PREPARE` statement 87105: roachtest: skip flaky acceptance/version-upgrade r=tbg a=adityamaru Skipping the flaky roachtest while we stabilize it. Informs: #87104 Release note: None Release justification: testing only change 87117: bazci: fix output path computation r=rail a=rickystewart These updates were happening in-place so `bazci` was constructing big, silly paths like `backupccl_test/shard_6_of_16/shard_7_of_16/shard_13_of_16/...` We just need to copy the variable here. Release justification: Non-production code changes Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
dhartunian
left a comment
There was a problem hiding this comment.
Added some high level comments. I'm in disagreement about whether we should do this.
I understand the motivation to:
- provide a path independent of obs service
- cover the scenario where sql runs in a separate process in dedicated
but I'm not convinced that those are strong enough requirements given our limited time available.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @ajwerner, @andreimatei, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Yeah, I could see how it could be confusing from an "app tenant" perspective once SH/Dedicated customers start internalizing the separation of roles between the system and app tenants. My question was more around the short term confusion for early adopters of the UA who haven't yet switched mindsets to an isolated world.
We currently have this behavior from a workload vs system perspective and I am not aware that this is a source of confusion for customers.
docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):
The proposal here stems from a strong request from the O11y team / PM to offer a DB console experience that's a "drop in" replacement for the single-tenancy experience. They think that mandating that toggle for users who are evolving from single-tenancy is adding too much friction.
I can understand this desire to have the metrics there by default but I assumed this would be done through access to all the global metrics, instead of by making separate tenant metrics for SQL stuff.
We've been requested to offer CC Dedicated users the ability to view KV-level metrics, as some of these metrics can be used for performance optimizations. Since Dedicated users won't have access to system tenant db console, we need to achieve this in a different way.
This can be accomplished by allowing tsdb access for the app tenant.
docs/RFCS/20220821_tenant_timeseries.md line 19 at r2 (raw file):
(Tenant timeseries storage would not be enabled in CC Serverless, but it would be in Self-hosted and CC Dedicated for the time being.)
I am not on the same page with you about this project.
My thinking was that multi-tenant tsdb storage was not something we should do since it complicates an aging timeseries subsystem that we want to reduce reliance on instead of continue extending.
My expectation:
TSDB should continue to work as-is and tenants should gain access to query it through the capabilities subsystem.
Dedicated clusters with multiple tenants do not get per-tenant metrics in tsdb.
docs/RFCS/20220821_tenant_timeseries.md line 163 at r2 (raw file):
This means that when DB Console is opened for the application tenant, it cannot even call the `/ts/db` operation to retrieve data points.
I assumed this would be achieved via tenant capabilities to query the "global" tsdb.
docs/RFCS/20220821_tenant_timeseries.md line 625 at r2 (raw file):
The main alternative is “do nothing”, but this will result in a UX regression when a CC Dedicated/SH user migrates to multi-tenancy.
The option of keeping a global tsdb and letting tenants access the whole thing should be considered here.
ajwerner
left a comment
There was a problem hiding this comment.
Sorry for not publishing these comments when I typed them over a month ago.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @andreimatei, and @knz)
docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
There are some kv metrics would could be viewed as tenant scoped
Andrew that's interesting, do you have some examples in mind?
The kvserver exports per-tenant metrics to prometheus for storage here and rate-limiting here. Those are the only per-tenant metrics we track.
docs/RFCS/20220821_tenant_timeseries.md line 641 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
What do you mean?
Above you state that the RFC doesn't take a position on whether secondary tenant sql pods should write metrics to the internal timeseries database. If they were to, we'd need to figure out the cost-control and fairness aspects of that RPC. It feels like you could add something down here about how secondary tenants in serverless and their interactions with the timeseries database remains unresolved.
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
a920516 to
21941a1
Compare
|
@aadityasondhi as we discussed, I just added a commit into the branch to simplify/reduce the implementation sections to not be too detailed. Here are the next steps:
When you have the new PR number, feel free to add a link to it here, so we can merge this PR (with the old RFC). |
|
Note to self (or maybe for you Aaditya) - we need to file a followup issue to change the code to:
|
|
Here's the new issue about the tsdb capability: #102414 |
|
@aadityasondhi yes, these changes work. Can you squash + rebase on latest master? That will unlock CI. |
3bd611f to
9cb05b2
Compare
Release justification: docs only change Release note: None
9cb05b2 to
ffb0b51
Compare
|
Rebased, squashed, and green CI |
|
LGTM. @aadityasondhi since you have reviewed it, can you approve the PR? We need the approval to merge. |
|
bors r=aadityasondhi |
|
Build succeeded: |
|
bors r=aadityasondhi |
|
Already running a review |
RFC text: https://github.com/knz/cockroach/blob/20220821-tenant-timeseries/docs/RFCS/20220821_tenant_timeseries.md
Tech note: https://github.com/knz/cockroach/blob/20220821-tenant-timeseries/docs/tech-notes/timeseries.md
Release justification: docs only change
Epic: CRDB-18797