split log-cache from doppler, use syslog ingress #949

mkocher · 2022-02-01T19:26:34Z

Making this change for a few reasons:

The scaling needs of dopplers and log-cache are often different, so
grouping them together can be problematic. Dopplers are limited to ~40
instances and some high traffic foundations need larger log-cache
instance groups.
Syslog ingress eliminates the load on dopplers and traffic controllers
to get envelopes to log-cache. This increases the load slightly on
diego cells, and eliminates significant load on dopplers/tc's.

It's recommended after deploying this change to evaluate the memory
allocated doppler nodes and switch them to compute heavy instances and
deploy log-cache to high memory instances.

Please take a moment to review the questions before submitting the PR

Has a cf-deployment including this change passed cf-acceptance-tests?

YES
NO

Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.

YES - please choose the category from below. Feel free to provide additional details.
NO

Types of breaking changes:
2. increases VM footprint of cf-deployment - e.g. new jobs, new add ons, increases # of instances etc.
3. modifies, deletes or moves the name of a job or instance group in the main manifest
4. modifies the name or deletes a property of a job or instance group in the main manifest

How should this change be described in cf-deployment release notes?

Something brief that conveys the change and is written with the persona (Alana, Cody...) in mind. See previous release notes for examples.

Log Cache is now deployed separately from Doppler on its own instance group. Operators should consider scaling the memory on Doppler nodes down and using high memory Log Cache nodes. Operators should notice reduced CPU usage on Doppler & Traffic Controller and slight increase CPU usage by Syslog Forwarder.

Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?

YES - please specify
NO

Does this PR make a change to an experimental or GA'd feature/component?

experimental feature/component
GA'd feature/component

Please provide Acceptance Criteria for this change?

bosh vms should show a log-cache instance group.
bosh ssh log-cache should be a vm running log-cache and assorted processes
bosh ssh doppler should show a vm not running log cache when executing sudo monit summary

What is the level of urgency for publishing this change?

Urgent - unblocks current or future work
Slightly Less than Urgent

Tag your pair, your PM, and/or team!

@rroberts2222

cf-rel-int-status-bot · 2022-02-01T19:31:18Z

Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀

cf-rel-int-status-bot · 2022-02-01T21:27:03Z

Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀

Making this change for a few reasons: - The scaling needs of dopplers and log-cache are often different, so grouping them together can be problematic. Dopplers are limited to ~40 instances and some high traffic foundations need larger log-cache instance groups. - Syslog ingress eliminates the load on dopplers and traffic controllers to get envelopes to log-cache. This increases the load slightly on diego cells, and eliminates significant load on dopplers/tc's. It's recommended after deploying this change to evaluate the memory allocated to doppler nodes and switch them to compute heavy instances and deploy log-cache to high memory instances.

They didn't seem to be used and would need to be updated to work with the separate log cache instance group.

davewalter

Thanks @mkocher & @rroberts2222. This looks good overall. I just had one change to request regarding the deprecated experimental ops-files.

operations/experimental/use-logcache-syslog-ingress.yml

operations/experimental/use-logcache-syslog-ingress-windows2019.yml

We had made these ops files no-ops in an earlier commit, here we are removing them.

davewalter

LGTM

mkocher · 2022-02-03T20:02:35Z

Noticed that cc defaults to using doppler as the stats server which is breaking stats. We'll push a fix.

cf-deployment.yml

ctlong

Some other operators may not want to move to the syslog ingress model, and to accommodate them it is probably a good idea to include some ops files that would restore the previous log-cache-nozzle / RLP ingress.

cf-deployment.yml

cf-rel-int-status-bot · 2022-02-10T00:14:43Z

Hello friend, it looks like your pull request has failed one or more of our checks. Please take a look! 👀

ctlong

Need to update the ops files tests

…pler As part of the latest cf-deployment, [log-cache is no longer nested under doppler](cloudfoundry/cf-deployment#949) Pipeline failure caused by the change: ``` operation [0] in ops-files/log-cache-reduce-memory.yml failed': Expected to find exactly one matching array item for path '/instance_groups/name=doppler/jobs/name=log-cache' but found 0 ```

- In #949 Log Cache was split out from the doppler instance group to its own log-cache instance group - Log Cache was also configured to use syslog ingress by default, rather than the previous behaviour which was to use the Reverse Log Proxy - Operators who had previously used the experimental ops-file to opt into syslog ingress (operations/experimental/use-logcache-syslog-ingress.yml) would already have had the `log_cache_syslog_tls` credential in their CredHub - When these operators attempted to upgrade to v18.0.0 the certificate was not re-generated by default, leading to a mismatch between the new service name and the existing certificate - Specify `update_mode: converge` so that the certificate is re-generated and the syslog agent will be able to send logs to the log cache syslog server Fixes: ``` failed to write to log-cache.service.cf.internal:6067, retrying in 8.192s, err: x509: certificate is valid for q-s3.doppler.default.cf.bosh, doppler.service.cf.internal, not log-cache.service.cf.internal ```

ctlong · 2022-03-30T16:24:16Z

This PR can cause significant log downtime if operators are not prepared for it when they first upgrade to a release its shipped in (v18.0.0+).

There has been a lot of conversation in slack about this so I wanted to repost some here for posterity. In order to minimize log downtime operators should deploy twice, once to make the new log cache service alias available to every VM, and again to actually deploy the change. The deployment order should be something along the lines of:

New Log Cache VMs
Scheduler VM: lets syslog agents in Diego Cells know to send logs/metrics to the new Log Cache (logs/metrics will still be sent to old log cache as long as syslog ingress to log cache was not already enabled)
API VMs: tells CAPI to use the new Log Cache rather than the old one
Doppler VMs: gets rid of the old log cache

To be safe, it's recommended to initially scale your new Log Cache up to the same instance count of Doppler VMs that you previously had, and then to look at metrics after the deploy in order to scale down both your Log Cache and your Doppler footprint.

acrmp · 2022-04-26T19:22:47Z

Note that with this change the syslog agents running on the diego cells and other VMs need to be able to talk to the Log Cache Syslog Server on port 6067. Operators that are running diego cells within isolation segments may have to adjust their firewall rules.

to SendSpikeMetrics with EmitTimer instead of EmitGauge This is to mitigate an issue that started to happen when we started using syslog-ingress Context: [cloudfoundry/cf-deployment#949](cloudfoundry/cf-deployment#949)

rroberts2222-zz force-pushed the separate-log-cache branch from 6f920dc to 71a0d19 Compare February 1, 2022 21:21

rroberts2222-zz force-pushed the separate-log-cache branch from 71a0d19 to 381b2ca Compare February 1, 2022 22:51

mkocher marked this pull request as ready for review February 2, 2022 17:52

mkocher added 2 commits February 2, 2022 18:49

remove remove-logging-pipeline-with-danger ops files

0158943

They didn't seem to be used and would need to be updated to work with the separate log cache instance group.

mkocher force-pushed the separate-log-cache branch from fae959e to 0158943 Compare February 2, 2022 18:50

davewalter requested changes Feb 3, 2022

View reviewed changes

operations/experimental/use-logcache-syslog-ingress.yml Outdated Show resolved Hide resolved

operations/experimental/use-logcache-syslog-ingress-windows2019.yml Outdated Show resolved Hide resolved

Remove use-log-cache-syslog-ingress ops files

6c4c0a0

We had made these ops files no-ops in an earlier commit, here we are removing them.

davewalter approved these changes Feb 3, 2022

View reviewed changes

ctlong added this to To do in CF-Deployment Feb 3, 2022

mkocher marked this pull request as draft February 3, 2022 20:01

update cc log-cache url

0e2e1e6

mkocher marked this pull request as ready for review February 4, 2022 00:27

ctlong mentioned this pull request Feb 7, 2022

Consider changing the Log Cache default host cloudfoundry/capi-release#224

Open

Gerg reviewed Feb 7, 2022

View reviewed changes

cf-deployment.yml Show resolved Hide resolved

ctlong requested changes Feb 7, 2022

View reviewed changes

cf-deployment.yml Outdated Show resolved Hide resolved

cf-deployment.yml Show resolved Hide resolved

cf-deployment.yml Show resolved Hide resolved

cf-deployment.yml Outdated Show resolved Hide resolved

CF-Deployment automation moved this from To do to Review in Progress Feb 7, 2022

ctlong mentioned this pull request Feb 8, 2022

App quotas not retrieved when Log Cache has syslog ingress turned on cloudfoundry/cloud_controller_ng#2669

Closed

rroberts2222 added 2 commits February 9, 2022 19:56

Remove log_provider cert, scale down dopplers

bc57c8c

Include ops files for using RLP ingress instead of syslog

926dcb9

ctlong approved these changes Feb 10, 2022

View reviewed changes

CF-Deployment automation moved this from Review in Progress to Approved Feb 10, 2022

CF-Deployment automation moved this from Approved to Review in Progress Feb 10, 2022

ctlong requested changes Feb 10, 2022

View reviewed changes

Add ops files for log cache nozzle to unit tests

9a7ce81

ctlong approved these changes Feb 10, 2022

View reviewed changes

CF-Deployment automation moved this from Review in Progress to Approved Feb 10, 2022

ctlong merged commit 10395d9 into cloudfoundry:develop Feb 17, 2022

CF-Deployment automation moved this from Approved to Done Feb 17, 2022

acrmp mentioned this pull request Mar 5, 2022

Converge log_cache_syslog_tls certificate #961

Merged

10 tasks

acrmp deleted the separate-log-cache branch April 26, 2022 19:22

ctlong mentioned this pull request May 5, 2022

Empty cf log and stats server temporarily unavailable with cf push. #971

Closed

mkocher mentioned this pull request Jul 21, 2022

Promote mkocher to ARP WG Logging & Metrics approver cloudfoundry/community#368

Merged

winkingturtle-vmw mentioned this pull request Mar 1, 2023

Update diego-logging-client cloudfoundry/diego-release#714

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split log-cache from doppler, use syslog ingress #949

split log-cache from doppler, use syslog ingress #949

mkocher commented Feb 1, 2022

cf-rel-int-status-bot commented Feb 1, 2022

cf-rel-int-status-bot commented Feb 1, 2022

davewalter left a comment

davewalter left a comment

mkocher commented Feb 3, 2022

ctlong left a comment

cf-rel-int-status-bot commented Feb 10, 2022

ctlong left a comment

ctlong commented Mar 30, 2022

acrmp commented Apr 26, 2022

split log-cache from doppler, use syslog ingress #949

split log-cache from doppler, use syslog ingress #949

Conversation

mkocher commented Feb 1, 2022

Please take a moment to review the questions before submitting the PR

Has a cf-deployment including this change passed cf-acceptance-tests?

Does this PR introduce a breaking change? Please take a moment to read through the examples before answering the question.

How should this change be described in cf-deployment release notes?

Does this PR introduce a new BOSH release into the base cf-deployment.yml manifest or any ops-files?

Does this PR make a change to an experimental or GA'd feature/component?

Please provide Acceptance Criteria for this change?

What is the level of urgency for publishing this change?

Tag your pair, your PM, and/or team!

cf-rel-int-status-bot commented Feb 1, 2022

cf-rel-int-status-bot commented Feb 1, 2022

davewalter left a comment

Choose a reason for hiding this comment

davewalter left a comment

Choose a reason for hiding this comment

mkocher commented Feb 3, 2022

ctlong left a comment

Choose a reason for hiding this comment

cf-rel-int-status-bot commented Feb 10, 2022

ctlong left a comment

Choose a reason for hiding this comment

ctlong commented Mar 30, 2022

acrmp commented Apr 26, 2022