Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split log-cache from doppler, use syslog ingress #949

Merged
merged 7 commits into from
Feb 17, 2022

Conversation

mkocher
Copy link
Member

@mkocher mkocher commented Feb 1, 2022

Making this change for a few reasons:

  • The scaling needs of dopplers and log-cache are often different, so
    grouping them together can be problematic. Dopplers are limited to ~40
    instances and some high traffic foundations need larger log-cache
    instance groups.
  • Syslog ingress eliminates the load on dopplers and traffic controllers
    to get envelopes to log-cache. This increases the load slightly on
    diego cells, and eliminates significant load on dopplers/tc's.

It's recommended after deploying this change to evaluate the memory
allocated doppler nodes and switch them to compute heavy instances and
deploy log-cache to high memory instances.

Please take a moment to review the questions before submitting the PR

Has a cf-deployment including this change passed cf-acceptance-tests?

  • YES
  • NO

Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.

  • YES - please choose the category from below. Feel free to provide additional details.
  • NO

Types of breaking changes:
2. increases VM footprint of cf-deployment - e.g. new jobs, new add ons, increases # of instances etc.
3. modifies, deletes or moves the name of a job or instance group in the main manifest
4. modifies the name or deletes a property of a job or instance group in the main manifest

How should this change be described in cf-deployment release notes?

Something brief that conveys the change and is written with the persona (Alana, Cody...) in mind. See previous release notes for examples.

Log Cache is now deployed separately from Doppler on its own instance group. Operators should consider scaling the memory on Doppler nodes down and using high memory Log Cache nodes. Operators should notice reduced CPU usage on Doppler & Traffic Controller and slight increase CPU usage by Syslog Forwarder.

Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?

  • YES - please specify
  • NO

Does this PR make a change to an experimental or GA'd feature/component?

  • experimental feature/component
  • GA'd feature/component

Please provide Acceptance Criteria for this change?

bosh vms should show a log-cache instance group.
bosh ssh log-cache should be a vm running log-cache and assorted processes
bosh ssh doppler should show a vm not running log cache when executing sudo monit summary

What is the level of urgency for publishing this change?

  • Urgent - unblocks current or future work
  • Slightly Less than Urgent

Tag your pair, your PM, and/or team!

@rroberts2222

@cf-rel-int-status-bot
Copy link

Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀

@cf-rel-int-status-bot
Copy link

Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀

@mkocher mkocher marked this pull request as ready for review February 2, 2022 17:52
Making this change for a few reasons:
- The scaling needs of dopplers and log-cache are often different, so
  grouping them together can be problematic. Dopplers are limited to ~40
  instances and some high traffic foundations need larger log-cache
  instance groups.
- Syslog ingress eliminates the load on dopplers and traffic controllers
  to get envelopes to log-cache. This increases the load slightly on
  diego cells, and eliminates significant load on dopplers/tc's.

It's recommended after deploying this change to evaluate the memory
allocated to doppler nodes and switch them to compute heavy instances and
deploy log-cache to high memory instances.
They didn't seem to be used and would need to be updated to work with
the separate log cache instance group.
Copy link
Member

@davewalter davewalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkocher & @rroberts2222. This looks good overall. I just had one change to request regarding the deprecated experimental ops-files.

We had made these ops files no-ops in an earlier commit, here we are
removing them.
Copy link
Member

@davewalter davewalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ctlong ctlong added this to To do in CF-Deployment Feb 3, 2022
@mkocher mkocher marked this pull request as draft February 3, 2022 20:01
@mkocher
Copy link
Member Author

mkocher commented Feb 3, 2022

Noticed that cc defaults to using doppler as the stats server which is breaking stats. We'll push a fix.

cf-deployment.yml Show resolved Hide resolved
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other operators may not want to move to the syslog ingress model, and to accommodate them it is probably a good idea to include some ops files that would restore the previous log-cache-nozzle / RLP ingress.

cf-deployment.yml Outdated Show resolved Hide resolved
cf-deployment.yml Show resolved Hide resolved
cf-deployment.yml Show resolved Hide resolved
cf-deployment.yml Outdated Show resolved Hide resolved
CF-Deployment automation moved this from Review in Progress to Approved Feb 10, 2022
@cf-rel-int-status-bot
Copy link

Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀

CF-Deployment automation moved this from Approved to Review in Progress Feb 10, 2022
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the ops files tests

CF-Deployment automation moved this from Review in Progress to Approved Feb 10, 2022
@ctlong ctlong merged commit 10395d9 into cloudfoundry:develop Feb 17, 2022
CF-Deployment automation moved this from Approved to Done Feb 17, 2022
sweinstein22 added a commit to cloudfoundry/capi-ci that referenced this pull request Feb 24, 2022
…pler

As part of the latest cf-deployment, [log-cache is no longer nested under doppler](cloudfoundry/cf-deployment#949)

Pipeline failure caused by the change:
```
operation [0] in ops-files/log-cache-reduce-memory.yml failed': Expected to find exactly one matching array item for path '/instance_groups/name=doppler/jobs/name=log-cache' but found 0
```
@acrmp acrmp mentioned this pull request Mar 5, 2022
10 tasks
ctlong pushed a commit that referenced this pull request Mar 5, 2022
- In #949 Log Cache was split out from the doppler instance group to its
  own log-cache instance group
- Log Cache was also configured to use syslog ingress by default, rather
  than the previous behaviour which was to use the Reverse Log Proxy
- Operators who had previously used the experimental ops-file to opt into
  syslog ingress (operations/experimental/use-logcache-syslog-ingress.yml)
  would already have had the `log_cache_syslog_tls` credential in their
  CredHub
- When these operators attempted to upgrade to v18.0.0 the certificate
  was not re-generated by default, leading to a mismatch between the new
  service name and the existing certificate
- Specify `update_mode: converge` so that the certificate is re-generated
  and the syslog agent will be able to send logs to the log cache syslog
  server

Fixes:

```
failed to write to log-cache.service.cf.internal:6067, retrying in 8.192s, err: x509: certificate is valid for q-s3.doppler.default.cf.bosh, doppler.service.cf.internal, not log-cache.service.cf.internal
```
@ctlong
Copy link
Member

ctlong commented Mar 30, 2022

This PR can cause significant log downtime if operators are not prepared for it when they first upgrade to a release its shipped in (v18.0.0+).

There has been a lot of conversation in slack about this so I wanted to repost some here for posterity. In order to minimize log downtime operators should deploy twice, once to make the new log cache service alias available to every VM, and again to actually deploy the change. The deployment order should be something along the lines of:

  1. New Log Cache VMs
  2. Scheduler VM: lets syslog agents in Diego Cells know to send logs/metrics to the new Log Cache (logs/metrics will still be sent to old log cache as long as syslog ingress to log cache was not already enabled)
  3. API VMs: tells CAPI to use the new Log Cache rather than the old one
  4. Doppler VMs: gets rid of the old log cache

To be safe, it's recommended to initially scale your new Log Cache up to the same instance count of Doppler VMs that you previously had, and then to look at metrics after the deploy in order to scale down both your Log Cache and your Doppler footprint.

@acrmp
Copy link
Member

acrmp commented Apr 26, 2022

Note that with this change the syslog agents running on the diego cells and other VMs need to be able to talk to the Log Cache Syslog Server on port 6067. Operators that are running diego cells within isolation segments may have to adjust their firewall rules.

@acrmp acrmp deleted the separate-log-cache branch April 26, 2022 19:22
winkingturtle-vmw added a commit to cloudfoundry/diego-release that referenced this pull request Feb 23, 2023
to SendSpikeMetrics with EmitTimer instead of EmitGauge

This is to mitigate an issue that started to happen when we started using
syslog-ingress

Context:
[cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)
winkingturtle-vmw added a commit to cloudfoundry/diego-release that referenced this pull request Mar 1, 2023
to SendSpikeMetrics with EmitTimer instead of EmitGauge

This is to mitigate an issue that started to happen when we started using
syslog-ingress

Context:
[cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)
mariash pushed a commit to cloudfoundry/diego-release that referenced this pull request Mar 6, 2023
to SendSpikeMetrics with EmitTimer instead of EmitGauge

This is to mitigate an issue that started to happen when we started using
syslog-ingress

Context:
[cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants