Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base,kvserver,server: configuration of provisioned bandwidth for a store #86063

Merged
merged 1 commit into from
Aug 25, 2022

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Aug 12, 2022

A previous PR #85722 added support for disk bandwidth as a bottlneck
resource, in the admission control package. To utilize this, admission
control needs to be provided the provisioned bandwidth and the observed
read and write bytes. This PR adds configuration support for this via
the StoreSpec (that uses the --store flag). The StoreSpec now has an
optional ProvisionedRateSpec that contains the name of the disk
corresponding to the store, and an optional provisioned bandwidth,
that are specified as
provisioned-rate=name=[:bandwidth=].

The disk-name is used to map the DiskStats, retrieved via the existing
code in status.GetDiskCounters to the correct Store. These DiskStats
contain the read and write bytes. The optional bandwidth is used to
override the provisioned bandwidth set via the new cluster setting
kv.store.admission.provisioned_bandwidth.

Fixes #82898

Release note (ops change): Disk bandwidth constraint can now be
used to control admission of elastic writes. This requires configuration
for each store, via the --store flag, that now contains an optional
provisioned-rate field. The provisioned-rate field, if specified,
needs to provide a disk-name for the store and optionally a disk
bandwidth. If the disk bandwidth is not provided the cluster setting
kv.store.admission.provisioned_bandwidth will be used. The cluster
setting defaults to 0 (which means that the disk bandwidth constraint
is disabled). If the effective disk bandwidth, i.e., using the possibly
overridden cluster setting is 0, there is no disk bandwidth constraint.
Additionally, the admission control cluster setting
admission.disk_bandwidth_tokens.elastic.enabled (defaults to true)
can be used to turn off enforcement even when all the other
configuration has been setup. Turning off enforcement will still
output all the relevant information about disk bandwidth usage, so
can be used to observe part of the mechanism in action.
To summarize, to enable this for a cluster with homogenous disk,
provide a disk-name in the provisioned-rate field in the store-spec,
and set the kv.store.admission.provisioned_bandwidth cluster setting
to the bandwidth limit. To only get information about disk bandwidth
usage by elastic traffic (currently via logs, not metrics), do the
above but also set admission.disk_bandwidth_tokens.elastic.enabled
to false.

Release justification: Low risk, high benefit change that allows
an operator to enable new functionality (disabled by default).

@sumeerbhola sumeerbhola requested review from tbg, irfansharif and a team August 12, 2022 21:37
@sumeerbhola sumeerbhola requested review from a team as code owners August 12, 2022 21:37
@sumeerbhola sumeerbhola requested a review from a team August 12, 2022 21:37
@sumeerbhola sumeerbhola requested a review from a team as a code owner August 12, 2022 21:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lacks any tests, since I want to get an opinion first on the approach.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look over the code too closely (will do once ready with tests), but at a high-level this LGTM. We should assume that most deployments are single store and use homogenous setups, so gear the release notes/help text to emphasize the use of the cluster setting and only use the per-store flag if there's non-homogeneity. Users with multi-store setups but homogeneity in provisioned bandwidth can still get away with using just the cluster setting. Other things I'd spell out in the help text is what happens if this limit is too high, or too low (you've already mentioned that this is optional, but worth spelling out the benefits of opting in). Since heterogeneity in provisioned bandwidth is rare (there are likely other problems in such deployments and I don't recall any incidents where I've observed such heterogeneity), and this is all optional, I'd also be completely ok ripping out the per-store flags and just have the cluster setting. It's simpler and less error prone. We can see how useful the per-store flag is over the next release once this is out in the wild.

type ProvisionedRateSpec struct {
// DiskName is the name of the disk observed by the code in disk_counters.go
// when retrieving stats for this store.
DiskName string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this DiskName field, I think. Looking at the top-level StoreSpec, we already have a Path to uniquely identify the disk device we're using. Let's continue using that; it's a required field unless using type=mem, which this doesn't apply to anyway. I'm looking at this test for ex:

{"/mnt/hda1", "", StoreSpec{Path: "/mnt/hda1"}},

This also makes the flag syntax easier (we can drop the name argument).

@@ -273,6 +348,10 @@ var fractionRegex = regexp.MustCompile(`^([-]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+|[0-
// - 20% -> 20% of the available space
// - 0.2 -> 20% of the available space
// - attrs=xxx:yyy:zzz A colon separated list of optional attributes.
// - provisioned-rate=name=<disk-name>[:bandwidth=<bandwidth-bytes>] The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the name thing as discussed above, the syntax this gives us is: cockroach start ... --store=<path>,provisioned-rate=:bandwidth=12500000. Could this instead be cockroach start ... --store=<path>,bandwidth=119MB/s. Similar to the allowable formats for size, but with the (optional) /s suffix. If later we need to learn about provisioned IOPs, we can introduce an iops=42/s thing.


// ProvisionedBandwidthForAdmissionControl set a value of the provisioned
// bandwidth for each store in the cluster.
var ProvisionedBandwidthForAdmissionControl = settings.RegisterIntSetting(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use RegisterByteSizeSetting instead? Or better yet, add an RegisterByteRateSetting (shouldn't be difficult) to get to SET CLUSTER SETTING kv.store.admission.provisioned_bandwidth = '125 MiB/s;. There are other cluster settings that should've had this but default to the ByteSize variant (server.consistency_check.max_rate for ex.), so I'll defer to you if you want to improve the status quo.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @irfansharif, and @sumeerbhola)


pkg/base/store_spec.go line 168 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We don't need this DiskName field, I think. Looking at the top-level StoreSpec, we already have a Path to uniquely identify the disk device we're using. Let's continue using that; it's a required field unless using type=mem, which this doesn't apply to anyway. I'm looking at this test for ex:

{"/mnt/hda1", "", StoreSpec{Path: "/mnt/hda1"}},

This also makes the flag syntax easier (we can drop the name argument).

I don't see a way to avoid the name mapping configuration. The stats that we get from disk_counters.go have names like nvme1n1, nvme2n1 (on EBS -- I have not looked at what we see when using PD), so there needs to be a way to map that name to a store. There can also be disks that have nothing to do with any of the stores, and those stats need to be ignored.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @sumeerbhola)


pkg/base/store_spec.go line 168 at r1 (raw file):

Previously, sumeerbhola wrote…

I don't see a way to avoid the name mapping configuration. The stats that we get from disk_counters.go have names like nvme1n1, nvme2n1 (on EBS -- I have not looked at what we see when using PD), so there needs to be a way to map that name to a store. There can also be disks that have nothing to do with any of the stores, and those stats need to be ignored.

Oh, I missed that. Sorry. For a given filename path, the way to retrieve the corresponding device name (the same as seen by disk_counters.go), you can df -P <path>. I don't know off hand what system calls that makes underneath, or whether there's an importable Go library to retrieve just that, but it should be a thing or we can write such a library.

Sources: https://stackoverflow.com/questions/13403866/how-to-get-a-device-partition-name-of-a-file. That refers to a st_dev, which I think is https://go.dev/src/syscall/ztypes_linux_amd64.go#L102.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @sumeerbhola)


pkg/base/store_spec.go line 168 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Oh, I missed that. Sorry. For a given filename path, the way to retrieve the corresponding device name (the same as seen by disk_counters.go), you can df -P <path>. I don't know off hand what system calls that makes underneath, or whether there's an importable Go library to retrieve just that, but it should be a thing or we can write such a library.

Sources: https://stackoverflow.com/questions/13403866/how-to-get-a-device-partition-name-of-a-file. That refers to a st_dev, which I think is https://go.dev/src/syscall/ztypes_linux_amd64.go#L102.

I'm also ok just not doing this work in favor of the cluster setting. Asking users to specify the right name is a bit more work on their end, and more steps to get wrong, if it's something we can retrieve programatically given we have a file path to consult.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @irfansharif, and @sumeerbhola)


pkg/base/store_spec.go line 168 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'm also ok just not doing this work in favor of the cluster setting. Asking users to specify the right name is a bit more work on their end, and more steps to get wrong, if it's something we can retrieve programatically given we have a file path to consult.

I don't see a good platform independent library for this. There is some discussion on https://groups.google.com/g/golang-nuts/c/mu8XMmRXMOk which is linux specific. There is also https://github.com/gyuho/linux-inspect, also linux specific (and I don't know whether we want such a dependency anyway).
The name in df does match what disk_counters.go is producing, though there are many things in the latter not in the former e.g. on EBS I see nvme0n1 and nvme0n1p1, and on PD I see sda, sda1, sda14, which don't appear in df (sdb is the device for /mnt/data1 on PD).

I can take a dependency on that library and make the name optional. Thoughts?

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @sumeerbhola)


pkg/base/store_spec.go line 168 at r1 (raw file):

Previously, sumeerbhola wrote…

I don't see a good platform independent library for this. There is some discussion on https://groups.google.com/g/golang-nuts/c/mu8XMmRXMOk which is linux specific. There is also https://github.com/gyuho/linux-inspect, also linux specific (and I don't know whether we want such a dependency anyway).
The name in df does match what disk_counters.go is producing, though there are many things in the latter not in the former e.g. on EBS I see nvme0n1 and nvme0n1p1, and on PD I see sda, sda1, sda14, which don't appear in df (sdb is the device for /mnt/data1 on PD).

I can take a dependency on that library and make the name optional. Thoughts?

I would prefer just copying some code over instead of taking on a dependency (that library hasn't seen commits in five years so we might as well understand what we're copying over). Realistically we'd be maintaining this dependency, so might as well copy it into tree. We'd also have to make this dependency linux specific so provide fallback behaviour for unsupported platforms. Do you want to defer this work for now by only using the cluster setting? (I ask because I don't feel this is adding much given the headache.)

@sumeerbhola
Copy link
Collaborator Author

We'd also have to make this dependency linux specific so provide fallback behaviour for unsupported platforms. Do you want to defer this work for now by only using the cluster setting? (I ask because I don't feel this is adding much given the headache.)

Yes, I would prefer not taking this on now.
But I can't use only the cluster setting, since we need a name mapping. Are you saying specify the device name also in a cluster setting, since it is likely the same across all nodes, and it would work for single store nodes?

@irfansharif
Copy link
Contributor

I'm saying something simpler, instead of introducing the diskname->storeID mapping at all, simply use the summed disk counters instead:

func getSummedDiskCounters(ctx context.Context) (diskStats, error) {
diskCounters, err := getDiskCounters(ctx)
if err != nil {
return diskStats{}, err
}
return sumDiskCounters(diskCounters), nil
}

And perhaps disable this functionality for multi-store nodes until we do the legwork for diskname -> store ID accounting. Does that sound reasonable? This is all entirely opinion on my part.

@sumeerbhola
Copy link
Collaborator Author

disable this functionality for multi-store nodes until we do the legwork for diskname -> store ID accounting.

Multi-store nodes were one of the important production cases where disk bandwidth was scarce, so I definitely do not want to do that. How about we stick with this current configuration schema, and add a TODO that we want to eventually eliminate/reduce the need to provide the name. All the stuff in StoreSpec added here is optional, so this should not introduce configuration churn.

@irfansharif
Copy link
Contributor

SGTM.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!
Added tests

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @irfansharif)


pkg/base/store_spec.go line 168 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I would prefer just copying some code over instead of taking on a dependency (that library hasn't seen commits in five years so we might as well understand what we're copying over). Realistically we'd be maintaining this dependency, so might as well copy it into tree. We'd also have to make this dependency linux specific so provide fallback behaviour for unsupported platforms. Do you want to defer this work for now by only using the cluster setting? (I ask because I don't feel this is adding much given the headache.)

Based on our discussion, I have a added a long TODO comment here about eliminating the DiskName field and making ProvisionedRateSpec even more optional.


pkg/base/store_spec.go line 351 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Ignoring the name thing as discussed above, the syntax this gives us is: cockroach start ... --store=<path>,provisioned-rate=:bandwidth=12500000. Could this instead be cockroach start ... --store=<path>,bandwidth=119MB/s. Similar to the allowable formats for size, but with the (optional) /s suffix. If later we need to learn about provisioned IOPs, we can introduce an iops=42/s thing.

I've kept this nested inside provisioned-rate for now, since we do need the disk name and this is all for purposes of rate control by the system. IOPS fits in that naming scheme, which is why it isn't called provisioned-bandwidth.


pkg/kv/kvserver/store.go line 3964 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Use RegisterByteSizeSetting instead? Or better yet, add an RegisterByteRateSetting (shouldn't be difficult) to get to SET CLUSTER SETTING kv.store.admission.provisioned_bandwidth = '125 MiB/s;. There are other cluster settings that should've had this but default to the ByteSize variant (server.consistency_check.max_rate for ex.), so I'll defer to you if you want to improve the status quo.

Switched to RegisterByteSizeSetting

@sumeerbhola sumeerbhola force-pushed the disk_bandwidth3 branch 2 times, most recently from 904e80d to c895c48 Compare August 22, 2022 02:33
A previous PR cockroachdb#85722 added support for disk bandwidth as a bottlneck
resource, in the admission control package. To utilize this, admission
control needs to be provided the provisioned bandwidth and the observed
read and write bytes. This PR adds configuration support for this via
the StoreSpec (that uses the --store flag). The StoreSpec now has an
optional ProvisionedRateSpec that contains the name of the disk
corresponding to the store, and an optional provisioned bandwidth,
that are specified as
provisioned-rate=name=<disk-name>[:bandwidth=<bandwidth-bytes>].

The disk-name is used to map the DiskStats, retrieved via the existing
code in status.GetDiskCounters to the correct Store. These DiskStats
contain the read and write bytes. The optional bandwidth is used to
override the provisioned bandwidth set via the new cluster setting
kv.store.admission.provisioned_bandwidth.

Fixes cockroachdb#82898

Release note (ops change): Disk bandwidth constraint can now be
used to control admission of elastic writes. This requires configuration
for each store, via the --store flag, that now contains an optional
provisioned-rate field. The provisioned-rate field, if specified,
needs to provide a disk-name for the store and optionally a disk
bandwidth. If the disk bandwidth is not provided the cluster setting
kv.store.admission.provisioned_bandwidth will be used. The cluster
setting defaults to 0 (which means that the disk bandwidth constraint
is disabled). If the effective disk bandwidth, i.e., using the possibly
overridden cluster setting is 0, there is no disk bandwidth constraint.
Additionally, the admission control cluster setting
admission.disk_bandwidth_tokens.elastic.enabled (defaults to true)
can be used to turn off enforcement even when all the other
configuration has been setup. Turning off enforcement will still
output all the relevant information about disk bandwidth usage, so
can be used to observe part of the mechanism in action.
To summarize, to enable this for a cluster with homogenous disk,
provide a disk-name in the provisioned-rate field in the store-spec,
and set the kv.store.admission.provisioned_bandwidth cluster setting
to the bandwidth limit. To only get information about disk bandwidth
usage by elastic traffic (currently via logs, not metrics), do the
above but also set admission.disk_bandwidth_tokens.elastic.enabled
to false.

Release justification: Low risk, high benefit change that allows
an operator to enable new functionality (disabled by default).
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 11 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

@sumeerbhola
Copy link
Collaborator Author

TFTR!

@sumeerbhola
Copy link
Collaborator Author

bors r=irfansharif

@craig
Copy link
Contributor

craig bot commented Aug 24, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Aug 24, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 25, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

admission: disk bandwidth as a bottleneck resource
3 participants