Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new default values to the Guardian flags network-pool and max-containers #6031

Merged
merged 4 commits into from Sep 21, 2020

Conversation

muntac
Copy link
Contributor

@muntac muntac commented Sep 1, 2020

What does this PR accomplish?

Feature

Part of #5985

Changes proposed by this PR:

If --network-pool is unset Guardian defaults to 10.80.0.0/22 which allows 1024 addresses, implicitly limiting a
Guardian worker to 250 containers (4 addresses reserved per container). This is true even if max-containers is set to
a higher value.

We set network-pool to 10.80.0.0/16 to increase the allowable addresses significantly ensuring the container
count is only limited by the explicit max-containers flag. If a user passes a custom value we use that instead.

Update

  • Exposing both flags as options in the worker command now.
  • Defaulting the maximum container limit to 250 to avoid stampeding herd problem.
  • Defaulting containerd max-container value to do the same.

More details on reasoning in an issue comment.

Notes to reviewer:

Inspecting processes running on a Guardian worker without having set CONCOURSE_GARDEN_NETWORK_POOL shows the following.
Screen Shot 2020-09-01 at 5 01 18 PM

When passing a custom value of 10.80.0.0/20:
Screen Shot 2020-09-01 at 5 05 17 PM

Related PRs
concourse/concourse-chart#150
concourse/concourse-bosh-release#121

Release Note

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • Documentation reviewed
  • Release notes reviewed
  • PR acceptance performed
  • New config flags added? Ensure that they are added to the
    BOSH and
    Helm packaging; otherwise, ignored for
    the integration
    tests

    (for example, if they are Garden configs that are not displayed in the
    --help text).

If network-pool is unset Guardian defaults to 10.80.0.0/22 which allows 1024 addresses implicitly limiting a
Guardian worker to 250 containers (4 addresses per container). This is true even if max-containers is set to
a higher value.
We set network-pool to 10.80.0.0/16 to increase the allowable addresses significantly ensuring the container
count is only limited by the explicit max-containers flag.

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
muntac added a commit to concourse/concourse-bosh-release that referenced this pull request Sep 1, 2020
CONCOURSE_GARDEN_ALLOW_HOST_ACCESS will have a value of "false"
even without a default value being provided as it is represented
as a bool in the Guardian codebase.

The recommendation to have CONCOURSE_GARDEN_MAX_CONTAINERS at 250
was based off CloudFoundry's Diego which also uses Guardian.
Concourse deployments have been operated at a much higher limit
and do not need to be held to this limit.

The binary will set CONCOURSE_GARDEN_NETWORK_POOL to a default value of
10.80.0.0/16 as of concourse/concourse#6031.

concourse/concourse#5985

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
@xtreme-sameer-vohra
Copy link
Contributor

Would it make sense to pull this in into GuardianRuntime so its more explicit ?

@chenbh
Copy link
Contributor

chenbh commented Sep 3, 2020

@xtreme-sameer-vohra I would say we should only move it there if we also expose max-containers for Guardian as well.

An alternative would be for us to document a list of commonly used Guardian flags like max-containers and network-pool somewhere (either in code, CLI flags, or docs site)

@muntac
Copy link
Contributor Author

muntac commented Sep 8, 2020

@xtreme-sameer-vohra @chenbh I believe Guardian specific flags were removed from the CLI listing i.e. Guardian Runtime for easier reasoning about the runtimes. The flags present under GuardianRuntime currently are Concourse specific, and all Guardian specific ones have to be looked up in the Guardian documentation. So I would prefer to note the default values in the docs site. Let me know your thoughts.

I posted a question in the BOSH release PR about determining the default value of max-containers that we would like to put here.

@xtreme-sameer-vohra I was unsure of what we should consider an average case upper limit with workload and machine variability taken into consideration.

I had thought we could have it unset initially and wait for feedback from users to see around what limits they start seeing problems. I haven't been able to find any past issues describing 250 being selected as the limit due to user feedback about performance. Mostly see stuff around the insufficient subnet problem, or hitting the max container limit.

Have you seen users run into performance problems where the issue was solved by limiting the number of containers?

@muntac
Copy link
Contributor Author

muntac commented Sep 8, 2020

I put in 250 for now, and noted in #6043 to test for new limits when we're doing drills for containerd vs guardian.

@xtreme-sameer-vohra
Copy link
Contributor

@xtreme-sameer-vohra @chenbh I believe Guardian specific flags were removed from the CLI listing i.e. Guardian Runtime for easier reasoning about the runtimes. The flags present under GuardianRuntime currently are Concourse specific, and all Guardian specific ones have to be looked up in the Guardian documentation. So I would prefer to note the default values in the docs site. Let me know your thoughts.

I posted a question in the BOSH release PR about determining the default value of max-containers that we would like to put here.

@xtreme-sameer-vohra I was unsure of what we should consider an average case upper limit with workload and machine variability taken into consideration.
I had thought we could have it unset initially and wait for feedback from users to see around what limits they start seeing problems. I haven't been able to find any past issues describing 250 being selected as the limit due to user feedback about performance. Mostly see stuff around the insufficient subnet problem, or hitting the max container limit.
Have you seen users run into performance problems where the issue was solved by limiting the number of containers?

It would be useful to document the default values in the binary concourse worker -h and the docs site, just so a user isn't surprised. I can see it becomes simpler to reason about Concourse specific flags (in Garden Runtime) and generic Garden flags(in Garden docs) with your distinction :)

Setting default limit to 250 as this is the only default value for Guardian that has
been widely tested against Concourse. This can be increased in the future with more
thorough testing with production workloads. #5985

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
muntac added a commit to concourse/concourse-bosh-release that referenced this pull request Sep 14, 2020
CONCOURSE_GARDEN_ALLOW_HOST_ACCESS will have a value of "false"
even without a default value being provided as it is represented
as a bool in the Guardian codebase.

The recommendation to have CONCOURSE_GARDEN_MAX_CONTAINERS at 250
was based off CloudFoundry's Diego which also uses Guardian.
Concourse deployments have been operated at a much higher limit
and do not need to be held to this limit.

The binary will set CONCOURSE_GARDEN_NETWORK_POOL to a default value of
10.80.0.0/16 as of concourse/concourse#6031.

concourse/concourse#5985

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
muntac added a commit to concourse/concourse-chart that referenced this pull request Sep 14, 2020
concourse/concourse#6031 exposes Guardian's max-containers and network-pool
as part of the worker command's options. This change keeps the chart and
binary in sync.

concourse/concourse#5985

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
@muntac muntac changed the title Change default value for Guardian network-pool Add new default values Guardian network-pool and max-containers Sep 14, 2020
Limiting this to 250 similar to Guardian to avoid
stampeding herd problem. With stress testing in the future
this can be updated to a more accurate value.

issue#5985

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
Copy link
Contributor

@xtreme-sameer-vohra xtreme-sameer-vohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,
Looks like the wiring logic got dropped in the refactor. The two values aren't passed to gdn at the moment. They're just exposed in Concourse.

@muntac
Copy link
Contributor Author

muntac commented Sep 17, 2020

Hey,
Looks like the wiring logic got dropped in the refactor. The two values aren't passed to gdn at the moment. They're just exposed in Concourse.

Seeing the same :( Will let you know when I have a fix.

These flags were in the GuardianRuntime struct but they weren't being
detected by the detectGuardianFlags() function. Our theory is that the
go-flags package is unsetting these environment variables when parsing
them. Therefore when os.Environ() is called by detectGuardianFlags() the
GuardianRuntime struct flags are not found.

#5985

Signed-off-by: Muntasir Chowdhury <mchowdhury@pivotal.io>
Co-authored-by: Taylor Silva <tsilva@pivotal.io>
@muntac
Copy link
Contributor Author

muntac commented Sep 17, 2020

@xtreme-sameer-vohra fyi @taylorsilva and I pushed a fix but all the tests are breaking because of an elm package whose signature has changed at source.

@taylorsilva
Copy link
Member

The elm stuff fixed itself. I'm rerunning all test suites.

Copy link
Contributor

@xtreme-sameer-vohra xtreme-sameer-vohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !!

@muntac muntac merged commit 64a12a7 into master Sep 21, 2020
@muntac muntac deleted the default-pool branch September 21, 2020 13:46
@clarafu clarafu added the release/undocumented This didn't warrant being documented or put in release notes. label Sep 22, 2020
@clarafu clarafu changed the title Add new default values Guardian network-pool and max-containers Add new default values to the Guardian flags network-pool and max-containers Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement release/undocumented This didn't warrant being documented or put in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants