Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: harmonize GCE and AWS machine types #111140

Merged

Conversation

srosenberg
Copy link
Member

@srosenberg srosenberg commented Sep 22, 2023

Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.
Finally, we change the default zone allocation in GCE from exclusively
us-east1-b to ~25% us-central1-b and ~75% us-east1-b. This is
inteded to balance the quotas for local SSDs until we eventually
switch to PD-SSDs.

Epic: none
Fixes: #106570

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types branch 2 times, most recently from 4e356ad to e478b80 Compare September 24, 2023 00:38
@srosenberg srosenberg requested review from renatolabs, herkolategan, smg260 and a team and removed request for a team September 24, 2023 01:03
@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types branch from e478b80 to d315ea8 Compare September 24, 2023 02:24
@srosenberg srosenberg marked this pull request as ready for review September 24, 2023 02:24
@srosenberg srosenberg requested a review from a team as a code owner September 24, 2023 02:24
@srosenberg srosenberg added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Sep 24, 2023
@srosenberg srosenberg removed the request for review from a team September 24, 2023 02:27
@srosenberg srosenberg changed the title roachtest: harmonize machine types roachtest: harmonize GCE and AWS machine types Sep 24, 2023
@srosenberg
Copy link
Member Author

CI smoke test in progress: SELECT_PROBABILITY=0.4

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIll the global Ice Lake bump affect roachperf benchmarks? If so, should we hold this until after the 23.2 branch cut and performance regression work, to give us a stable baseline for comparisons?

@@ -168,6 +168,10 @@ func TestClusterMachineType(t *testing.T) {
{"n2-standard-32", 32},
{"n2-standard-64", 64},
{"n2-standard-96", 96},
// GCE machine types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray comment? There is already a comment above for "GCE machine types".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removing.

@srosenberg
Copy link
Member Author

WIll the global Ice Lake bump affect roachperf benchmarks? If so, should we hold this until after the 23.2 branch cut and performance regression work, to give us a stable baseline for comparisons?

It might have a small bump on the cpu-bound workloads using smaller vCPU density (those are most likely cascade lake, previous generation.) Do you really think it would be that disruptive to hold until after 23.2? I was just planning to add an annotation, since the other improvements seem to outweigh the small (cpu) gain.

@erikgrinaker
Copy link
Contributor

If the effect would be small, them probably not. Otherwise, we'll have to directly compare the release roachperf graphs with each other to detect regressions, we can't simply look at the single graph for master -- but if we're ok with the extra manual work then that's fine too.

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @renatolabs, @smg260, and @srosenberg)


-- commits line 4 at r1:
silly nit: "some"


-- commits line 26 at r1:
silly nit: "intended"

@srosenberg
Copy link
Member Author

If the effect would be small, them probably not. Otherwise, we'll have to directly compare the release roachperf graphs with each other to detect regressions, we can't simply look at the single graph for master -- but if we're ok with the extra manual work then that's fine too.

Right. My guess the effect is small, possibly negligible. Otherwise, I'll revert the change and postpone it until after branch cut.

Copy link
Collaborator

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to make sure GCE and AWS are consistent!


}
if shouldSupportLocalSSD {
family = family + "d"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this function by only having one check at the end for if shouldSupportLocalSSH { family += "d" }?

@@ -126,6 +127,8 @@ type jsonVM struct {
ProvisioningModel string
}
MachineType string
// CPU platform corresponding to machine type; see https://cloud.google.com/compute/docs/cpu-platforms
CpuPlatform string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: probably more idiomatic to call this CPUPlaform: https://github.com/golang/go/wiki/CodeReviewComments#initialisms (and also more consistent with CPUArch and CPUFamily).

pkg/roachprod/vm/vm.go Show resolved Hide resolved
if rand.Float64() < 0.75 {
zones = []string{defaultZones[0]}
} else {
zones = []string{defaultZones[1]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't be great if we get unlucky and run a test that takes large backups and stores them in our backup testing buckets, which are only in us-east1 (not multiregion).

Copy link
Member Author

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @renatolabs, and @smg260)


-- commits line 4 at r1:

Previously, herkolategan (Herko Lategan) wrote…

silly nit: "some"

It's actually inteded (sic) to be same (sic), but I'll rephrase :) Basically, one roachtest executed in both clouds would have different multipliers.


-- commits line 26 at r1:

Previously, herkolategan (Herko Lategan) wrote…

silly nit: "intended"

Fixing, thanks.


pkg/cmd/roachtest/spec/machine_type.go line 115 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Can we simplify this function by only having one check at the end for if shouldSupportLocalSSH { family += "d" }?

Yep, good catch!


pkg/roachprod/vm/vm.go line 52 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Would be nice to include some examples of what kinds of inputs this is known to work well: e.g., GCE tools, binary detection tool, etc.

Yep, good idea!


pkg/roachprod/vm/gce/gcloud.go line 131 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: probably more idiomatic to call this CPUPlaform: https://github.com/golang/go/wiki/CodeReviewComments#initialisms (and also more consistent with CPUArch and CPUFamily).

I struggled with this one; CPUPlatform just hurts my eyes :) I'll comply with the more idiomatic convention.


pkg/roachprod/vm/gce/gcloud.go line 973 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This won't be great if we get unlucky and run a test that takes large backups and stores them in our backup testing buckets, which are only in us-east1 (not multiregion).

Yep, I am taking this out; to be dealt with in #111371

@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types branch 2 times, most recently from 74bab9a to f89da86 Compare September 27, 2023 20:09
if family == "c7g" && size == "24xlarge" {
family = "c6id"
// There is no m7gd.24xlarge, fall back to (c|m|r)6id.24xlarge.
if family == "m7gd" && size == "24xlarge" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be m7g and m6i below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I shouldn't have hurried!

@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types branch from f89da86 to e506755 Compare September 27, 2023 23:06
RaduBerinde pushed a commit to RaduBerinde/cockroach that referenced this pull request Nov 2, 2023
…l revert

This is a backport of the "merged" diff of the following PRs:
  roachtest: harmonize GCE and AWS machine types cockroachdb#111140
  roachtest: revert harmonize GCE and AWS machine types cockroachdb#111633

Release justification: test-only code, keeping roachtest in sync.
Epic: none
Release note: None
@renatolabs
Copy link
Collaborator

circle back after the 23.2 branch is cut?

We should revisit this. It still causes the occasional test failure (e.g., #113279).

@srosenberg
Copy link
Member Author

circle back after the 23.2 branch is cut?

We should revisit this. It still causes the occasional test failure (e.g., #113279).

Yep, I'll prepare a new PR. We can wait until 23.2.1 since it appears fairly imminent.

srosenberg added a commit to srosenberg/cockroach that referenced this pull request Jan 18, 2024
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Jan 27, 2024
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] cockroachdb#111140
[2] cockroachdb#111633

Epic: none
Fixes: cockroachdb#106570

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Feb 8, 2024
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] cockroachdb#111140
[2] cockroachdb#111633

Epic: none
Fixes: cockroachdb#106570

Release note: None
craig bot pushed a commit that referenced this pull request Feb 9, 2024
117852: roachtest: harmonize GCE, AWS, Azure machine types r=renatolabs a=srosenberg

Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] #111140
[2] #111633

Epic: none
Fixes: #106570

Release note: None

Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Feb 13, 2024
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] cockroachdb#111140
[2] cockroachdb#111633

Epic: none
Fixes: cockroachdb#106570

Release note: None
cockroach-dev-inf pushed a commit that referenced this pull request Feb 14, 2024
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] #111140
[2] #111633

Epic: none
Fixes: #106570

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Feb 16, 2024
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] cockroachdb#111140
[2] cockroachdb#111633

Epic: none
Fixes: cockroachdb#106570

Release note: None
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this pull request Feb 21, 2024
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] cockroachdb#111140
[2] cockroachdb#111633

Epic: none
Fixes: cockroachdb#106570

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed
6 participants