-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: harmonize GCE and AWS machine types #111140
roachtest: harmonize GCE and AWS machine types #111140
Conversation
4e356ad
to
e478b80
Compare
e478b80
to
d315ea8
Compare
CI smoke test in progress: SELECT_PROBABILITY=0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIll the global Ice Lake bump affect roachperf benchmarks? If so, should we hold this until after the 23.2 branch cut and performance regression work, to give us a stable baseline for comparisons?
pkg/cmd/roachtest/cluster_test.go
Outdated
@@ -168,6 +168,10 @@ func TestClusterMachineType(t *testing.T) { | |||
{"n2-standard-32", 32}, | |||
{"n2-standard-64", 64}, | |||
{"n2-standard-96", 96}, | |||
// GCE machine types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray comment? There is already a comment above for "GCE machine types".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removing.
It might have a small bump on the cpu-bound workloads using smaller vCPU density (those are most likely |
If the effect would be small, them probably not. Otherwise, we'll have to directly compare the release roachperf graphs with each other to detect regressions, we can't simply look at the single graph for master -- but if we're ok with the extra manual work then that's fine too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @renatolabs, @smg260, and @srosenberg)
-- commits
line 4 at r1:
silly nit: "some"
-- commits
line 26 at r1:
silly nit: "intended"
Right. My guess the effect is small, possibly negligible. Otherwise, I'll revert the change and postpone it until after branch cut. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to make sure GCE and AWS are consistent!
|
||
} | ||
if shouldSupportLocalSSD { | ||
family = family + "d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this function by only having one check at the end for if shouldSupportLocalSSH { family += "d" }
?
pkg/roachprod/vm/gce/gcloud.go
Outdated
@@ -126,6 +127,8 @@ type jsonVM struct { | |||
ProvisioningModel string | |||
} | |||
MachineType string | |||
// CPU platform corresponding to machine type; see https://cloud.google.com/compute/docs/cpu-platforms | |||
CpuPlatform string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: probably more idiomatic to call this CPUPlaform
: https://github.com/golang/go/wiki/CodeReviewComments#initialisms (and also more consistent with CPUArch
and CPUFamily
).
pkg/roachprod/vm/gce/gcloud.go
Outdated
if rand.Float64() < 0.75 { | ||
zones = []string{defaultZones[0]} | ||
} else { | ||
zones = []string{defaultZones[1]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be great if we get unlucky and run a test that takes large backups and stores them in our backup testing buckets, which are only in us-east1 (not multiregion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @renatolabs, and @smg260)
Previously, herkolategan (Herko Lategan) wrote…
silly nit: "some"
It's actually inteded (sic) to be same (sic), but I'll rephrase :) Basically, one roachtest executed in both clouds would have different multipliers.
Previously, herkolategan (Herko Lategan) wrote…
silly nit: "intended"
Fixing, thanks.
pkg/cmd/roachtest/spec/machine_type.go
line 115 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can we simplify this function by only having one check at the end for
if shouldSupportLocalSSH { family += "d" }
?
Yep, good catch!
pkg/roachprod/vm/vm.go
line 52 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Would be nice to include some examples of what kinds of inputs this is known to work well: e.g., GCE tools, binary detection tool, etc.
Yep, good idea!
pkg/roachprod/vm/gce/gcloud.go
line 131 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: probably more idiomatic to call this
CPUPlaform
: https://github.com/golang/go/wiki/CodeReviewComments#initialisms (and also more consistent withCPUArch
andCPUFamily
).
I struggled with this one; CPUPlatform
just hurts my eyes :) I'll comply with the more idiomatic convention.
pkg/roachprod/vm/gce/gcloud.go
line 973 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This won't be great if we get unlucky and run a test that takes large backups and stores them in our backup testing buckets, which are only in us-east1 (not multiregion).
Yep, I am taking this out; to be dealt with in #111371
74bab9a
to
f89da86
Compare
if family == "c7g" && size == "24xlarge" { | ||
family = "c6id" | ||
// There is no m7gd.24xlarge, fall back to (c|m|r)6id.24xlarge. | ||
if family == "m7gd" && size == "24xlarge" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be m7g
and m6i
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I shouldn't have hurried!
f89da86
to
e506755
Compare
…l revert This is a backport of the "merged" diff of the following PRs: roachtest: harmonize GCE and AWS machine types cockroachdb#111140 roachtest: revert harmonize GCE and AWS machine types cockroachdb#111633 Release justification: test-only code, keeping roachtest in sync. Epic: none Release note: None
We should revisit this. It still causes the occasional test failure (e.g., #113279). |
Yep, I'll prepare a new PR. We can wait until |
TBD [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
117852: roachtest: harmonize GCE, AWS, Azure machine types r=renatolabs a=srosenberg Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] #111140 [2] #111633 Epic: none Fixes: #106570 Release note: None Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] #111140 [2] #111633 Epic: none Fixes: #106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.
This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified
MemPerCPU
:Standard
yields 4GB/cpu,High
yields 8GB/cpu,Auto
yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.Low
is supported only in GCE.Consequently,
n2-standard
maps tom6i
,n2-highmem
maps tor6i
,n2-custom
maps toc6i
, modulo local SSDs in which casem6id
isused, etc. Note, we also force
--gce-min-cpu-platform
toIce Lake
;isomorphic AWS machine types are exclusively on
Ice Lake
.Roachprod is extended to show cpu family and architecture on
List
.Cost estimation now correctly deals with custom machine types.
Finally, we change the default zone allocation in GCE from exclusively
us-east1-b
to ~25%us-central1-b
and ~75%us-east1-b
. This isinteded to balance the quotas for local SSDs until we eventually
switch to PD-SSDs.
Epic: none
Fixes: #106570
Release note: None