Skip to content

rfcs: new RFC and tech note about time series for tenants#86524

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20220821-tenant-timeseries
Apr 28, 2023
Merged

rfcs: new RFC and tech note about time series for tenants#86524
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20220821-tenant-timeseries

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Aug 21, 2022

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz knz force-pushed the 20220821-tenant-timeseries branch 4 times, most recently from 53bb900 to 86457a3 Compare August 21, 2022 16:25
@knz knz force-pushed the 20220821-tenant-timeseries branch 2 times, most recently from 54ecea9 to b644702 Compare August 21, 2022 16:32
@knz knz marked this pull request as ready for review August 21, 2022 16:32
@knz knz requested review from a team as code owners August 21, 2022 16:32
@knz knz force-pushed the 20220821-tenant-timeseries branch from b644702 to a214db5 Compare August 21, 2022 16:47
Copy link
Copy Markdown
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on the tech note and proposal

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @andreimatei, @dhartunian, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 249 at r1 (raw file):

  - etc.

  They will also be suffixed with the SQL instance ID, for example:

Thinking about this in a DB Console user context - do you envision a charts page where we allow secondary tenant users to dimension metrics by SQL instance ID? Or instead, to "hide" the concept of multiple SQL instances via aggregations?

Code quote:

 They will also be suffixed with the SQL instance ID, for example:

docs/RFCS/20220821_tenant_timeseries.md line 259 at r1 (raw file):

the SQL instance ID and node ID will match exactly and there will be no change in cardinality.

## Automatic classification of metrics

Thanks for addressing this!

Code quote:

## Automatic classification of metrics

docs/RFCS/20220821_tenant_timeseries.md line 277 at r1 (raw file):

We will introduce the ID as prefix only for metrics defined in the 2nd
registry (as explained below), leavint hose from the 1st registry unchanged.

nit: typo - "leaving those"

Suggestion:

leavint hose

docs/tech-notes/timeseries.md line 1 at r1 (raw file):

# The internal time series database (tsdb)

This is a wonderful piece of community service you've done writing this - thank you so much!

Code quote:

# The internal time series database (tsdb)

Copy link
Copy Markdown
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @dhartunian, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):

2. then, even if hypothetically we did start the tsdb collector,
   we would then see *ambiguity in all the SQL timeseries*.

Do we have a sense for how significant of a problem this will turn out to be in practice? I'd think that the number of SQL statements being run from the system/admin tenant would be a small fraction of those executed in the app tenant. With that in mind, are there specific examples you can cite where the UX is confusing?


docs/RFCS/20220821_tenant_timeseries.md line 165 at r1 (raw file):

   data points.

## Objective 2: desired semantics

I found it a bit confusing that the objectives were split by the "Problems with..." section. Does it make sense to cover the "Problems with..." topic first, and then cover the two objectives back-to-back?


docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):

  able to see historical metrics for the app tenant's SQL layer.

- By default, users accessing the DB Console for the application

Can you explain the motivation for this one? I'd think that users access the DB Console would need some way to access the storage metrics, but that method could be to toggle to the system tenant and view them there.


docs/RFCS/20220821_tenant_timeseries.md line 176 at r1 (raw file):

- By default, users accessing the DB Console for the application
  tenant must be able to see the storage metrics (KV-level) too,
  however an option must exist to revoke this access: in CC Dedicated,

CC Dedicated and Serverless?


docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):

- we will introduce a tenant ID prefix for only those timeseries that
  are scoped to a tenant (i.e. KV-level timeseries will not be
  prefixed, but SQL-level timeseries will).

Can you explain the motivation for not adding a tenant prefix for KV-level timeseries data? Is it because it's largely unnecessary? Helps with upgrades?


docs/RFCS/20220821_tenant_timeseries.md line 343 at r1 (raw file):

- SQL metrics will be exported separately, one per tenant, with the
  tenant *name* (not ID) included as a metric label

Can you add a comment here as to why we're going with tenant name and not tenant ID? Is there any concern down the road that we may have to change the name and this will create an upgrade challenge? Alternatively, if we ever allow true multi-tenancy (or even one additional tenant) within a shared process, will each tenant have to be named?


docs/RFCS/20220821_tenant_timeseries.md line 459 at r1 (raw file):

  matches the tenant ID prefix in the timeseries name.

The capability WRITE_TENANT_TIMESERIES will not be granted for CC

Does this imply that for Serverless, there will continue to be no SQL metrics written for tenants? If so, do you see this as a temporary state? Or do you think that long-term we'd not want to generate SQL metrics for tenants?


docs/RFCS/20220821_tenant_timeseries.md line 616 at r1 (raw file):

It also introduces the option (but not the mandate) to make separate
SQL pods able to record their timeseries in KV, via the new capability
WRITE_TENANT_TIMESERIES.

Is this also required to make this proposal work for Serverless? If so, perhaps you want to motivate the need for this further by leaning on that additional use case.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @andreimatei, @dhartunian, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Can you explain the motivation for not adding a tenant prefix for KV-level timeseries data? Is it because it's largely unnecessary? Helps with upgrades?

What would it mean to add a tenant prefix for kv-level timeseries data? There are some kv metrics would could be viewed as tenant scoped, in which case, we could publish tenant-level timeseries data with such a prefix, but the majority of KV metrics don't map nicely to tenants.


docs/RFCS/20220821_tenant_timeseries.md line 343 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Can you add a comment here as to why we're going with tenant name and not tenant ID? Is there any concern down the road that we may have to change the name and this will create an upgrade challenge? Alternatively, if we ever allow true multi-tenancy (or even one additional tenant) within a shared process, will each tenant have to be named?

We could choose to export use both name and ID. It would not affect the cardinality of metrics and it would allow a future where we export tenant metrics without a tenant_name label. Note that kvserver already exports metrics with a tenant_id label to prometheus


docs/RFCS/20220821_tenant_timeseries.md line 641 at r1 (raw file):

# Unresolved questions

N/A

I think we may as well list metrics related to non-app secondary tenants as unresolved.

@knz knz force-pushed the 20220821-tenant-timeseries branch from a214db5 to ca62873 Compare August 23, 2022 09:43
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @andreimatei, @dhartunian, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Do we have a sense for how significant of a problem this will turn out to be in practice? I'd think that the number of SQL statements being run from the system/admin tenant would be a small fraction of those executed in the app tenant. With that in mind, are there specific examples you can cite where the UX is confusing?

If a customer comes to us and says "I didn't run this statement, and yet the counter is increasing, how can I even trust this counter?" what do you propose we tell them?


docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Can you explain the motivation for this one? I'd think that users access the DB Console would need some way to access the storage metrics, but that method could be to toggle to the system tenant and view them there.

Good Q. After all I too was personally where you are here.

  1. The proposal here stems from a strong request from the O11y team / PM to offer a DB console experience that's a "drop in" replacement for the single-tenancy experience. They think that mandating that toggle for users who are evolving from single-tenancy is adding too much friction.

  2. We've been requested to offer CC Dedicated users the ability to view KV-level metrics, as some of these metrics can be used for performance optimizations. Since Dedicated users won't have access to system tenant db console, we need to achieve this in a different way.


docs/RFCS/20220821_tenant_timeseries.md line 176 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

CC Dedicated and Serverless?

Both indeed.


docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):

There are some kv metrics would could be viewed as tenant scoped

Andrew that's interesting, do you have some examples in mind?


docs/RFCS/20220821_tenant_timeseries.md line 249 at r1 (raw file):

we allow secondary tenant users to dimension metrics by SQL instance ID?

I was hoping that we would trick the DB console into considering the SQL instance ID as a node ID, without any change in the code. And then the UX would be the same with SQL instances as nodes currently.


docs/RFCS/20220821_tenant_timeseries.md line 277 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: typo - "leaving those"

Done.

Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @ajwerner, @andreimatei, @dhartunian, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 459 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Does this imply that for Serverless, there will continue to be no SQL metrics written for tenants? If so, do you see this as a temporary state? Or do you think that long-term we'd not want to generate SQL metrics for tenants?

I have no opinion on this. The RFC aims to introduce the mechanisms, it's then for PM to decide how to use them.


docs/RFCS/20220821_tenant_timeseries.md line 616 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Is this also required to make this proposal work for Serverless? If so, perhaps you want to motivate the need for this further by leaning on that additional use case.

I don't think we have pressure either way, so I don't think the RFC needs to mention that here.

(A RFC's purpose is not to discuss product-level trade-offs.)


docs/RFCS/20220821_tenant_timeseries.md line 641 at r1 (raw file):

Previously, ajwerner wrote…

I think we may as well list metrics related to non-app secondary tenants as unresolved.

What do you mean?

@ajstorm ajstorm requested review from abarganier and ajwerner August 26, 2022 15:35
Copy link
Copy Markdown
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajwerner, @andreimatei, @dhartunian, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):

Previously, knz (kena) wrote…

If a customer comes to us and says "I didn't run this statement, and yet the counter is increasing, how can I even trust this counter?" what do you propose we tell them?

Yeah, I could see how it could be confusing from an "app tenant" perspective once SH/Dedicated customers start internalizing the separation of roles between the system and app tenants. My question was more around the short term confusion for early adopters of the UA who haven't yet switched mindsets to an isolated world.


docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):

Previously, knz (kena) wrote…

Good Q. After all I too was personally where you are here.

  1. The proposal here stems from a strong request from the O11y team / PM to offer a DB console experience that's a "drop in" replacement for the single-tenancy experience. They think that mandating that toggle for users who are evolving from single-tenancy is adding too much friction.

  2. We've been requested to offer CC Dedicated users the ability to view KV-level metrics, as some of these metrics can be used for performance optimizations. Since Dedicated users won't have access to system tenant db console, we need to achieve this in a different way.

Interesting. I was unaware of (1), and didn't see it in my last review of the DB Console design for UA. Is this notion formalized somewhere? (2) is new to me too, but makes sense.


docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):

What would it mean to add a tenant prefix for kv-level timeseries data?
I was just thinking that there might be value to prefixing KV metrics with 0, to denote that they were emitted by the system tenant. Are there any benefits to being explicit here instead of implicit?


docs/RFCS/20220821_tenant_timeseries.md line 343 at r1 (raw file):

Previously, ajwerner wrote…

We could choose to export use both name and ID. It would not affect the cardinality of metrics and it would allow a future where we export tenant metrics without a tenant_name label. Note that kvserver already exports metrics with a tenant_id label to prometheus

That SGTM. I do worry that requiring a name could be fragile.


docs/RFCS/20220821_tenant_timeseries.md line 459 at r1 (raw file):

Previously, knz (kena) wrote…

I have no opinion on this. The RFC aims to introduce the mechanisms, it's then for PM to decide how to use them.

That makes sense. We might want to soften the language here to say "...will not initially be granted for CC Serverless tenant, but may be granted at some point down the road".

craig Bot pushed a commit that referenced this pull request Aug 30, 2022
86563: ts: fix the pretty-printing of tsd keys r=abarganier a=knz

Found while working on #86524.

Release justification: bug fix

Release note (bug fix): When printing keys and range start/end
boundaries for time series, the displayed structure of keys
was incorrect. This is now fixed.

86904: sql: allow mismatch type numbers in `PREPARE` statement r=rafiss a=ZhouXing19

Previously, we only allow having the same number of parameters and placeholders
in a `PREPARE` statement. This is not compatible with Postgres14's behavior.

This commit is to loosen the restriction and enable this compatibility.
We now take `max(#placeholders, #parameters)` as the true length
 of parameters of the prepare statement. For each parameter, we first
look at the type deduced from the query stmt. If we can't deduce it, 
we take the type hint for this param.

I.e. we now allow queries such as 

```
PREPARE args_test_many(int, int) as select $1
// 2 parameters, but only 1 placeholder in the query.

PREPARE args_test_few(int) as select $1, $2::int
// 1 parameter, but 2 placeholders in the query.
```

fixes #86375

Release justification: Low risk, high benefit changes to existing functionality
Release note: allow mismatch type numbers in `PREPARE` statement

87105: roachtest: skip flaky acceptance/version-upgrade r=tbg a=adityamaru

Skipping the flaky roachtest while we stabilize it.

Informs: #87104

Release note: None

Release justification: testing only change

87117: bazci: fix output path computation r=rail a=rickystewart

These updates were happening in-place so `bazci` was constructing big,
silly paths like `backupccl_test/shard_6_of_16/shard_7_of_16/shard_13_of_16/...`
We just need to copy the variable here.

Release justification: Non-production code changes
Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Jane Xing <zhouxing@uchicago.edu>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some high level comments. I'm in disagreement about whether we should do this.

I understand the motivation to:

  1. provide a path independent of obs service
  2. cover the scenario where sql runs in a separate process in dedicated
    but I'm not convinced that those are strong enough requirements given our limited time available.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @ajwerner, @andreimatei, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 141 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Yeah, I could see how it could be confusing from an "app tenant" perspective once SH/Dedicated customers start internalizing the separation of roles between the system and app tenants. My question was more around the short term confusion for early adopters of the UA who haven't yet switched mindsets to an isolated world.

We currently have this behavior from a workload vs system perspective and I am not aware that this is a source of confusion for customers.


docs/RFCS/20220821_tenant_timeseries.md line 174 at r1 (raw file):

The proposal here stems from a strong request from the O11y team / PM to offer a DB console experience that's a "drop in" replacement for the single-tenancy experience. They think that mandating that toggle for users who are evolving from single-tenancy is adding too much friction.

I can understand this desire to have the metrics there by default but I assumed this would be done through access to all the global metrics, instead of by making separate tenant metrics for SQL stuff.

We've been requested to offer CC Dedicated users the ability to view KV-level metrics, as some of these metrics can be used for performance optimizations. Since Dedicated users won't have access to system tenant db console, we need to achieve this in a different way.

This can be accomplished by allowing tsdb access for the app tenant.


docs/RFCS/20220821_tenant_timeseries.md line 19 at r2 (raw file):

  (Tenant timeseries storage would not be enabled
  in CC Serverless, but it would be in Self-hosted
  and CC Dedicated for the time being.)

I am not on the same page with you about this project.
My thinking was that multi-tenant tsdb storage was not something we should do since it complicates an aging timeseries subsystem that we want to reduce reliance on instead of continue extending.

My expectation:
TSDB should continue to work as-is and tenants should gain access to query it through the capabilities subsystem.
Dedicated clusters with multiple tenants do not get per-tenant metrics in tsdb.


docs/RFCS/20220821_tenant_timeseries.md line 163 at r2 (raw file):

   This means that when DB Console is opened for the application
   tenant, it cannot even call the `/ts/db` operation to retrieve
   data points.

I assumed this would be achieved via tenant capabilities to query the "global" tsdb.


docs/RFCS/20220821_tenant_timeseries.md line 625 at r2 (raw file):

The main alternative is “do nothing”, but this will result in a UX
regression when a CC Dedicated/SH user migrates to multi-tenancy.

The option of keeping a global tsdb and letting tenants access the whole thing should be considered here.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not publishing these comments when I typed them over a month ago.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @ajstorm, @andreimatei, and @knz)


docs/RFCS/20220821_tenant_timeseries.md line 201 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

There are some kv metrics would could be viewed as tenant scoped

Andrew that's interesting, do you have some examples in mind?

The kvserver exports per-tenant metrics to prometheus for storage here and rate-limiting here. Those are the only per-tenant metrics we track.


docs/RFCS/20220821_tenant_timeseries.md line 641 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

What do you mean?

Above you state that the RFC doesn't take a position on whether secondary tenant sql pods should write metrics to the internal timeseries database. If they were to, we'd need to figure out the cost-control and fairness aspects of that RPC. It feels like you could add something down here about how secondary tenants in serverless and their interactions with the timeseries database remains unresolved.

craig Bot pushed a commit that referenced this pull request Apr 14, 2023
101522: docs: new tech note on the timeseries database r=abarganier a=knz

Extracted from #86524 until it merges.

Release note: None
Epic: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented Apr 26, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@knz knz force-pushed the 20220821-tenant-timeseries branch from a920516 to 21941a1 Compare April 26, 2023 14:10
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 26, 2023

@aadityasondhi as we discussed, I just added a commit into the branch to simplify/reduce the implementation sections to not be too detailed. Here are the next steps:

  1. copy the file into a new PR, same title, new date.
  2. keep the same motivations/objectives.
  3. update the implementation section to sketch how you did the work.
  4. add some notes in the impl section to refer to what has not been done yet (foremost, the ability for secondary tenants to query the system's metrics but not that of other tenants)
  5. update the summary at the top.

When you have the new PR number, feel free to add a link to it here, so we can merge this PR (with the old RFC).

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 26, 2023

Note to self (or maybe for you Aaditya) - we need to file a followup issue to change the code to:

  • make tenants able to observe their own timeseries without a capability
  • make the new capability control access to system-wide metrics by a tenant (but not that of other tenants)

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 27, 2023

Here's the new issue about the tsdb capability: #102414

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 27, 2023

@aadityasondhi yes, these changes work. Can you squash + rebase on latest master? That will unlock CI.

@aadityasondhi aadityasondhi force-pushed the 20220821-tenant-timeseries branch from 3bd611f to 9cb05b2 Compare April 27, 2023 18:14
@aadityasondhi aadityasondhi requested review from a team April 27, 2023 18:14
@aadityasondhi aadityasondhi requested a review from a team as a code owner April 27, 2023 18:14
@aadityasondhi aadityasondhi requested review from rytaft and removed request for andreimatei April 27, 2023 18:14
Release justification: docs only change

Release note: None
@aadityasondhi aadityasondhi force-pushed the 20220821-tenant-timeseries branch from 9cb05b2 to ffb0b51 Compare April 27, 2023 18:14
@aadityasondhi
Copy link
Copy Markdown
Contributor

Rebased, squashed, and green CI

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 28, 2023

LGTM.

@aadityasondhi since you have reviewed it, can you approve the PR? We need the approval to merge.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 28, 2023

bors r=aadityasondhi

@craig
Copy link
Copy Markdown
Contributor

craig Bot commented Apr 28, 2023

Build succeeded:

@craig craig Bot merged commit 59dc183 into cockroachdb:master Apr 28, 2023
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Apr 30, 2023

bors r=aadityasondhi

@craig
Copy link
Copy Markdown
Contributor

craig Bot commented Apr 30, 2023

Already running a review

@knz knz deleted the 20220821-tenant-timeseries branch April 30, 2023 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants