Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST]: As a CF Operator, I expect that when I observe the "AppInstanceExceededLogRateLimitCount" metric, I can see the app instance details that caused this value to be incremented so I can take action with the app owner #457

Closed
sunjayBhatia opened this issue Mar 2, 2020 · 21 comments

Comments

@sunjayBhatia
Copy link
Contributor

sunjayBhatia commented Mar 2, 2020

Is your feature request related to a problem?

Right now, when AppInstanceExceededLogRateLimitCount is emitted (here), it is emitted as a per-cell tagged counter metric. The loggregator agent adds just the cell-specific tags to the metric, nothing more, resulting in metrics that look like this:

deployment:"cf" job:"diego-cell" index:"0e98fd00-47b2-4589-94f0-385f78b3a04d" ip:"10.0.1.12" tags:<key:"instance_id" value:"0e98fd00-47b2-4589-94f0-385f78b3a04d" > tags:<key:"source_id" value:"rep" > counterEvent:<name:"AppInstanceExceededLogRateLimitCount" delta:1 total:206 >

As a result, operators must do some additional work to filter the app instances on the cell this metric comes from to identify the actual culprit of the chatty logging. This is not super straightforward and we can do better for operators so that it is much easier to identify problematic apps (and even app instances).

Describe the solution you'd like

In the the log rate limit reporter, we have access to the metric tags set on the app's desired LRP and use them to push our own log line into the app log stream. We should be able to do the same thing with our AppInstanceExceededLogRateLimitCount metric in order to tag it accordingly and ensure the value emitted is not a per-cell metric but a per-app instance metric.

We can potentially just make a new version of our IncrementCounter method to add tags to the envelope we want to send using this option.

Diego repo

executor

Describe alternatives you've considered

  • An alternative is to do the emission logic here instead keeping in the log streamer package. This package already has reporters that periodically emit per-container (app instance) metrics.
    • maybe implement another metric reporter that is available to the log streamer
    • IMO, moving the actual call to the loggregator/metron client out to a goroutine separate from the main functionality is "safer," the way this code is currently written if the call to loggregator blocks, the work we care about is blocked; we use the parallel goroutine for metric emission pattern a lot I believe for this sort of purpose, and to keep metric emission logic/periodicity in one place so we don't have to re-implement periodic metrics repeatedly
  • If the "counter" type metric does not work, we can try to use a gauge style metric instead
@mdelillo
Copy link
Contributor

mdelillo commented Mar 4, 2020

In lieu of emitting app instance details in the log message, we have come up with some cf commands that will show you that data using existing plugins.

First, install the firehose and log-cache plugins:

cf install-plugin 'Firehose Plugin'
cf install-plugin 'log-cache'

This command will show when any app has exceeded the log rate: cf tail -f rep | grep AppInstanceExceededLogRateLimitCount

And this command will show the specific app which has exceeded the rate: cf nozzle -f LogMessage | grep 'app instance exceeded log rate limit'

@mdelillo mdelillo closed this as completed Mar 4, 2020
@sunjayBhatia
Copy link
Contributor Author

sunjayBhatia commented Mar 4, 2020

In lieu of emitting app instance details in the log message

I am not suggesting emitting app instance details in the log message, but instead tagging the counter metric (here) we emit with the same tags as the log message we put in the app log stream so someone aggregating metrics can drill down into the metric values for individual apps. This would change the source/instance id of the metric to be from the app instance instead of the cell

I'm also curious how this cf nozzle approach will work on a larger foundation, inspecting each log line from every app to find the chatty ones seems expensive

@sunjayBhatia
Copy link
Contributor Author

reopening so we can continue to chat about this

@sunjayBhatia sunjayBhatia reopened this Mar 4, 2020
@robertjsullivan
Copy link
Contributor

If we add a tag that adds application_instance_id and application_guid, the AppInstanceExceededLogRateLimitCount metric will have an enormous cardinality. This will create a large storage burden on downstream monitoring systems. Given that we expect operators will want to rectify the bad actors quickly, we would be storing a lot of metrics that are not being used.

I think that we should have the count be by diego cell only. Logging the offending app names to Diego's logs on a set cadence seems like a reasonable solution for identifying the bad actors.

@jvshahid
Copy link
Contributor

jvshahid commented Mar 6, 2020

As a CF Operator, I expect that when I observe the "AppInstanceExceededLogRateLimitCount" metric,

@sunjayBhatia I don't understand why the operator is the persona of interest in this case. I thought the whole purpose of this story was to avoid having applications negatively affect their neighbors. I think the app developer will be more interested in knowing that their application logs are getting dropped.

If we add a tag that adds application_instance_id and application_guid, the AppInstanceExceededLogRateLimitCount metric will have an enormous cardinality.

@robertjsullivan How is this different from the CPU and Memory metrics that we are currently emitting?

@aminjam
Copy link
Contributor

aminjam commented Mar 6, 2020

@robertjsullivan How is this different from the CPU and Memory metrics that we are currently emitting?

for CPU and memory, we end up emitting the application_id as the source_id (https://github.com/cloudfoundry/diego-logging-client/blob/88015a7/client.go#L217), as opposed to keeping the source_id as "rep", but emitting application level information under that metric

@sunjayBhatia
Copy link
Contributor Author

I would think we would also want to change the source id to be the application id to make the metric useful to app developers if we were going to tag it as suggested

@jvshahid yeah, that is a good point, this could be reframed as an app developer facing feature (so they can keep track of their log usage, sort of another resource usage metric to key off of), or maybe we can conclude that app developers have enough info for now with the log we push into the log stream telling them their app has been rate limited

@robertjsullivan
Copy link
Contributor

The cardinality isn't different than the cpu metric and memory metrics but I think those are more valuable as that data is constantly relevant. In this case, it's a counter that we'd be emitting for every instance, but only relevant to a small few.

If we wanted to emit a logging metric for applications, I think we should emit the total count of log lines from a container. An operator could use the rate() function to determine if it's violating the set threshold. App Devs would also find that useful as well.

@aminjam
Copy link
Contributor

aminjam commented Mar 11, 2020

I'm also curious how this cf nozzle approach will work on a larger foundation, inspecting each log line from every app to find the chatty ones seems expensive

@sunjayBhatia We tested this in a large foundation and were able to detect the apps logging in as an admin.

Closing this issue since we have enough information for app developers to know their app is being rate limited.

Please feel free to re-open this if you'd like to see a different outcome.

@aminjam aminjam closed this as completed Mar 11, 2020
sunjayBhatia added a commit that referenced this issue Mar 23, 2020
…tion compress pgzip image-spec xz

Submodule src/code.cloudfoundry.org/dockerapplifecycle 4d909ee2..ed3c37b3:
  > Updates for compilation with latest versions of dependencies
Submodule src/github.com/BurntSushi/toml 000000000...3012a1dbe (new submodule)
Submodule src/github.com/containers/image 495da41bd..4276a4ff6:
  > v5.3.0
  > Merge pull request #854 from mtrmac/arch-variant
  > Merge pull request #820 from nalind/manifest-list-retry-using-conversions
  > Merge pull request #853 from mtrmac/docker-tarfile-crash
  > Update containers/storage to v1.16.5
  > oci manifest: do not require whitelist of layer types
  > Deprecate SupportsEncryption call in manifest interface
  > Merge pull request #829 from yihuaf/dev/yihuaf/authfile
  > Merge pull request #849 from containers/dependabot/go_modules/github.com/klauspost/pgzip-1.2.2
  > Merge pull request #848 from rhatdan/readme
  > Merge pull request #836 from lumjjb/change_updated_image
  > containers-registries.conf.d: refine semantics
  > Merge pull request #845 from containers/dependabot/go_modules/github.com/containers/storage-1.16.3
  > Merge pull request #842 from knqyf263/add_registry_token
  > Merge pull request #846 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.3
  > Merge pull request #847 from containers/dependabot/go_modules/github.com/opencontainers/selinux-1.4.0
  > Merge pull request #831 from kandaaaaa/master
  > Bump github.com/containers/storage from 1.16.1 to 1.16.2
  > Bump github.com/vbauerster/mpb/v4 from 4.12.1 to 4.12.2
  > Merge pull request #793 from vrothberg/rewrite-message
  > Merge pull request #821 from TomSweeneyRedHat/dev/tsweeney/authfile
  > Merge pull request #837 from containers/dependabot/go_modules/github.com/containers/storage-1.16.1
  > Add $HOME/.config/containers/certs.d to perHostCertDirPath
  > Add $HOME/.config/containers/registries.conf to config path
  > Merge pull request #832 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.2
  > Bump github.com/opencontainers/selinux from 1.3.2 to 1.3.3
  > Merge pull request #834 from vrothberg/registries.conf.d
  > Merge pull request #830 from containers/dependabot/go_modules/github.com/ulikunitz/xz-0.5.7
  > Merge pull request #827 from containers/dependabot/go_modules/github.com/stretchr/testify-1.5.1
  > Merge pull request #828 from containers/dependabot/go_modules/github.com/opencontainers/selinux-1.3.2
  > Merge pull request #822 from containers/dependabot/go_modules/github.com/containers/storage-1.16.0
  > Merge pull request #823 from containers/dependabot/go_modules/github.com/vbauerster/mpb/v4-4.12.1
  > Merge pull request #824 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.1
  > Merge pull request #825 from containers/dependabot/go_modules/github.com/stretchr/testify-1.5.0
  > Merge pull request #818 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.0
  > Merge pull request #816 from ricardomaraschini/bubble-up-error
  > Add storage.Transport.NewStoreReference
  > Add validateImageID
  > Simplify ID value check
  > Add storage.Transport.GetStoreIfSet
  > Add tarball.NewReference
  > Remove tarballReference.transport
  > Add docker/archive.NewReference
  > Re-order the members of archiveReference to a more logical order
  > Merge pull request #815 from TomSweeneyRedHat/coc
  > Fixed registries.conf
  > Merge pull request #812 from vrothberg/release
  > Bump github.com/mtrmac/gpgme from 0.1.1 to 0.1.2
  > Merge pull request #809 from nalind/list-conversion
  > bump to v5.2.1-dev
  > v5.2.0
  > Merge pull request #806 from containers/dependabot/go_modules/github.com/containers/storage-1.15.8
  > Add tests for internal/iolimits
  > Also use ReadAtMost in docker/tarfile
  > Simplify error checking in iolimits.ReadAtMost
  > restrict sizes of inmem processed blobs
  > copy: relax platform checks
  > Merge pull request #777 from SUSE/registry-mirror-docker-docs
  > Merge pull request #802 from bkuschel/master
  > Merge pull request #800 from containers/dependabot/go_modules/github.com/opencontainers/selinux-1.3.1
  > Merge pull request #798 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.8
  > Merge pull request #796 from mtrmac/by-pointer
  > Merge pull request #795 from containers/dependabot/go_modules/github.com/containers/storage-1.15.7
  > Merge pull request #794 from mtrmac/update-gpgme
  > Merge pull request #791 from containers/dependabot/go_modules/github.com/pkg/errors-0.9.1
  > Bump github.com/pkg/errors from 0.8.1 to 0.9.0
  > Merge pull request #788 from mtrmac/mirror-errors
  > Merge pull request #771 from sseago/docker-image-dest-schema1-config
  > Merge pull request #774 from mtrmac/retry-cleanups
  > Merge pull request #787 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.7
  > Merge pull request #785 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.5
  > Merge pull request #782 from containers/dependabot/go_modules/github.com/containers/storage-1.15.5
  > Bump github.com/containers/storage from 1.15.3 to 1.15.4
  > Bump github.com/vbauerster/mpb/v4 from 4.11.1 to 4.11.2
  > Merge pull request #769 from TomSweeneyRedHat/dev/tsweeney/registries.confup
  > Merge pull request #779 from rhatdan/version
  > Merge pull request #773 from containers/dependabot/go_modules/github.com/containers/storage-1.15.3
  > Merge pull request #768 from containers/dependabot/go_modules/github.com/containers/storage-1.15.2
  > Merge pull request #767 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.4
  > Enable Cirrus validation tests
  > make test-skopeo: only run unit tests
  > Makefile: split tools installation
  > internal/pkg/keyctl: drop deprecation warning
  > keyctl: tests: check errors
  > update github.com/vbauerster/mpb to v4
  > Merge pull request #763 from containers/dependabot/go_modules/github.com/containers/storage-1.15.1
  > storage_test.go: check storage interfaces
  > Merge pull request #761 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.3
  > Merge pull request #759 from containers/dependabot/go_modules/github.com/containers/storage-1.15.0
  > Merge pull request #731 from bege13mot/golangci-lint
  > Merge pull request #673 from lumjjb/encryption-policy
  > Merge pull request #741 from containers/dependabot/go_modules/github.com/imdario/mergo-0.3.8
  > Merge pull request #740 from containers/dependabot/go_modules/github.com/docker/docker-credential-helpers-0.6.3
  > Merge pull request #754 from containers/dependabot/go_modules/github.com/containers/storage-1.14.0
  > Merge pull request #738 from SUSE/interface-fix
  > Merge pull request #752 from mtrmac/enforce-architecture
  > Merge pull request #744 from mtrmac/test-failures
  > Merge pull request #747 from Zyqsempai/495-allow-to-overwrite-default-tmp-dir
  > Merge pull request #732 from SUSE/reader-adaptions
  > Merge pull request #735 from mtrmac/release-5
  > Merge pull request #719 from QiWang19/resperrpr
  > Merge pull request #734 from nalind/choose-source
  > Merge pull request #400 from nalind/manifest-lists
  > Merge pull request #703 from vrothberg/status-checks
  > Merge pull request #722 from gabemontero/add-openshift-dockercfg-secret-mount-non-home
  > Merge pull request #727 from mtrmac/test-skopeo-v4
  > Merge pull request #726 from mtrmac/release-4.0.1
  > Merge pull request #721 from mtrmac/go-mod-major
  > Merge pull request #725 from nalind/guess-with-no-layers
  > Merge pull request #723 from vrothberg/conversion
  > Merge pull request #718 from rhatdan/master
  > Merge pull request #716 from mtrmac/compression-types
  > Merge pull request #715 from mtrmac/containers-libtrust
  > Merge pull request #717 from vrothberg/update-storage
  > Merge pull request #714 from mtrmac/update-mergo
  > Merge pull request #709 from QiWang19/resplog
  > Merge pull request #701 from lsm5/master-openssl
  > Merge pull request #711 from nalind/update-accepted-certs
  > Merge pull request #708 from vrothberg/replace
  > Merge pull request #705 from vrothberg/fix-annotations
  > Merge pull request #683 from QiWang19/rm_all_keyring
  > Merge pull request #563 from vrothberg/copy-update-mime-types
  > Merge pull request #693 from vrothberg/post-mod
  > Merge pull request #682 from praseodym/go-mod
  > Merge pull request #686 from nalind/keyctl-tests
  > Merge pull request #690 from vrothberg/fix-nil-check
  > Merge pull request #685 from nalind/tmpdir
  > Merge pull request #691 from vrothberg/fix-ci
  > Merge pull request #688 from vrothberg/add-nil-check
  > Merge pull request #684 from vrothberg/copy-default-to-gzip
  > Merge pull request #639 from giuseppe/zstd
  > Merge pull request #681 from mtrmac/3.0.2
  > Merge pull request #680 from QiWang19/disable_keyring
  > Merge pull request #678 from mtrmac/ostree-didnt-work-and-nobody-noticed
  > Merge pull request #677 from vrothberg/bump-version
  > Merge pull request #676 from mtrmac/keyctl-followup
  > Merge pull request #675 from nalind/keyctl-arches
  > Merge pull request #672 from vrothberg/bump-version
  > Merge pull request #630 from SUSE/registry-cache
  > Merge pull request #670 from SUSE/gocache-fix
  > Merge pull request #669 from mtrmac/bearer-token-tls
  > Merge pull request #667 from TomSweeneyRedHat/dev/tsweeney/touchupstore
  > Merge pull request #668 from QiWang19/keyring_authfile
  > Merge pull request #663 from mtrmac/drop-sysregistries-v1
  > Merge pull request #658 from vrothberg/enforce-blocking
  > Merge pull request #661 from rhatdan/install
  > Merge pull request #660 from rhatdan/vendor
  > Merge pull request #619 from QiWang19/key
  > Merge pull request #659 from vrothberg/release
  > Merge pull request #657 from vrothberg/use-spinners
  > Merge pull request #587 from Silvanoc/default-ostree-stub
  > Merge pull request #652 from jvanz/248
  > Merge pull request #651 from jvanz/issue510
  > Merge pull request #650 from mtrmac/docker-delete-comment
  > Merge pull request #648 from SUSE/typo-fix
  > Merge pull request #647 from vrothberg/delete-oci
  > Merge pull request #583 from mtrmac/storage-TryReusingBlob-harder
  > Merge pull request #637 from mtrmac/beautify-registries-toml
  > Merge pull request #576 from mtrmac/tarfile-cleanup
  > Merge pull request #646 from vrothberg/delete-oci
  > Merge pull request #644 from giuseppe/fix-selinux-ostree-err
  > Merge pull request #643 from mtrmac/2.0
  > Merge pull request #617 from mtrmac/mirror-cleanups
  > Merge pull request #635 from mtrmac/registries-v2-explicit-search
  > Merge pull request #629 from vrothberg/mirror-digest-only
  > Merge pull request #636 from mtrmac/sysregistriesv2-test-fixtures
  > Merge pull request #631 from TristanCacqueray/master
  > Merge pull request #626 from wking/normalize-registry-test-flake
  > Merge pull request #628 from mtrmac/mirror-cleanups-code
  > Merge pull request #625 from mtrmac/mirror-no-credentials
  > Merge pull request #627 from vrothberg/update-ubuntu
  > Merge pull request #623 from QiWang19/authComment
  > Merge pull request #622 from rhatdan/atomic
  > Merge pull request #621 from Silvanoc/silvano/go-binaries-not-in-path
  > Merge pull request #593 from wking/docker-config-test-fixtures
  > Merge pull request #615 from vrothberg/transports-docs
  > Merge pull request #564 from SUSE/docker-client-mirror
  > Merge pull request #616 from vrothberg/update-bolt
  > Merge pull request #597 from wking/GetAuthentication-for-logged-in
  > Merge pull request #613 from giuseppe/fix-namespace
  > Merge pull request #612 from giuseppe/fix-blobinfocache-rootless
  > Merge pull request #609 from rhatdan/version
  > Merge pull request #605 from rhatdan/vendor
  > Merge pull request #598 from wking/drop-redundant-domain-call
  > Merge pull request #602 from wking/blob-info-boltdb-separation
  > Merge pull request #603 from vrothberg/progress-bar-messages
  > Merge pull request #599 from wking/pat-path-typo
  > Merge pull request #600 from wking/drop-contributing-asterisks
  > Merge pull request #601 from vrothberg/fix-build
  > Merge pull request #594 from vrothberg/certs.d-manpage
  > Merge pull request #589 from wking/move-docker-config-tests
  > Merge pull request #592 from mtrmac/bic-memory
  > Merge pull request #591 from vrothberg/bump-version
  > Merge pull request #590 from vrothberg/sync-bic
  > Merge pull request #586 from mtrmac/mpg-fixes
  > Merge pull request #581 from vrothberg/mpb-progress-bars
  > Merge pull request #585 from nalind/storage-multiple-manifests
  > Merge pull request #573 from nalind/manifest-match
  > Merge pull request #584 from markmc/ecr-oci-invalid-json
  > Merge pull request #577 from arjun024/fix-resp-sigsegv
  > Merge pull request #579 from mtrmac/ostree-no-pipe
  > Merge pull request #578 from nalind/ostree-hang
  > Merge pull request #569 from umohnani8/export
  > Merge pull request #568 from teozkr/threadsafe-docker-archive-getblob
  > Merge pull request #565 from SUSE/log-cleanup
  > Merge pull request #557 from vrothberg/parallel-push
  > Merge pull request #561 from rhatdan/version
  > Merge pull request #558 from vrothberg/progress-bar-tty-check
  > Merge pull request #555 from mtrmac/home-gopath
  > Merge pull request #556 from vrothberg/add-vrothberg-to-maintainers
  > Merge pull request #553 from vrothberg/avoid-rc
  > Merge pull request #551 from vrothberg/progressbaroutput
  > Merge pull request #549 from vrothberg/lint
  > Merge pull request #545 from vrothberg/parapull
  > Merge pull request #543 from giuseppe/use-optimized-gzip
  > Merge pull request #542 from vrothberg/vendor-ostree
  > Merge pull request #534 from mtrmac/prefix-match
  > Merge pull request #541 from vrothberg/govet
  > Merge pull request #523 from kinvolk/dongsu/upgrade-go-1.11
  > Merge pull request #536 from mtrmac/blob-info-caching
  > Merge pull request #468 from vrothberg/regsv2-docker
  > Merge pull request #533 from darkowlzz/godoc-fix
  > Merge pull request #525 from mtrmac/go-selinux
  > Merge pull request #520 from mtrmac/search-order
  > Merge pull request #522 from kinvolk/dongsu/fix-test-errors
  > Merge pull request #519 from nalind/copy-digest
  > Merge pull request #341 from ashcrow/fix-pkgs-in-readme
  > Merge pull request #521 from runcom/fix-build-golint
  > Merge pull request #517 from mtrmac/pingV1-no-error
  > Merge pull request #516 from mtrmac/http-statustext
  > Merge pull request #515 from mtrmac/run-containers-error-handling
  > Merge pull request #512 from nalind/missing-ca-certs
  > Merge pull request #508 from nalind/ping-error
  > Merge pull request #505 from nalind/difflock
  > Merge pull request #506 from nalind/duplicate-layer
  > Merge pull request #503 from nlewo/fix-pause
  > Merge pull request #502 from umohnani8/search
  > Merge pull request #501 from lucab/ups/ostree-commit-errcheck
  > Merge pull request #500 from rhatdan/vendor
  > Merge pull request #489 from mtrmac/certs.d-multiple-ca-files
  > Merge pull request #499 from mtrmac/cleanups
  > Merge pull request #487 from mtrmac/delete-permissions
  > Merge pull request #498 from mtrmac/git-validation-epoch
  > Merge pull request #497 from vrothberg/fix-conf-references
  > Merge pull request #486 from Silvanoc/master
  > Merge pull request #493 from vrothberg/rename-docs
  > Merge pull request #491 from nalind/healthconfig-start-period
  > Merge pull request #490 from rhatdan/skopeo
  > Merge pull request #483 from nalind/docker-daemon-stub
  > Merge pull request #481 from mtrmac/docker-archive-auto-compression
  > Merge pull request #472 from mtrmac/storage-emptylayers
  > Merge pull request #480 from SUSE/send-account-field-to-authenticated-docker-registries
  > Merge pull request #478 from praveenkumar/issue_475
  > Merge pull request #474 from rhatdan/contributing
  > Merge pull request #473 from mtrmac/auth.json-path
  > Merge pull request #470 from umohnani8/search
  > Merge pull request #465 from TomSweeneyRedHat/dev/tsweeney/registriesman
  > Merge pull request #463 from vbatts/platform.2
  > Merge pull request #467 from vrothberg/sysregsv2-config-cache
  > Merge pull request #466 from mtrmac/sysregistriesv2-comments
  > Merge pull request #464 from vrothberg/cleanup-sysregsv2
  > Merge pull request #426 from mtrmac/305-cleanup-2
  > Merge pull request #455 from agonzalezro/wrap-nil
  > Merge pull request #462 from rhatdan/spelling
  > Merge pull request #461 from giuseppe/pinpoint-ostree-thread
  > Merge pull request #433 from rhatdan/registries.conf
  > Merge pull request #446 from mgoltzsche/docker-dependency-update
  > Merge pull request #459 from mtrmac/c-storage-dep
  > Merge pull request #458 from umohnani8/transport
  > Merge pull request #454 from vrothberg/https-fallback
  > Merge pull request #457 from mtrmac/docker-err-ignored
  > Merge pull request #456 from nlewo/pr-tempDir
  > Merge pull request #450 from vrothberg/multitags
  > Merge pull request #301 from mtrmac/daemon-identity
  > Merge pull request #441 from novas0x2a/follow-tags
  > Merge pull request #451 from umohnani8/auth
  > Merge pull request #429 from vrothberg/mirror-support
  > Merge pull request #444 from agonzalezro/remove-deps-from-vendor
  > Merge pull request #439 from umohnani8/transport
  > Merge pull request #440 from mtrmac/context-cleanups
  > Merge pull request #431 from novas0x2a/context-everywhere
  > Merge pull request #435 from rhatdan/policy.json
  > Merge pull request #436 from rhatdan/vendor
  > Merge pull request #434 from jlsalmon/docker-daemon-plain-http
  > Merge pull request #370 from nlewo/pr/docker-archive
  > Merge pull request #430 from vrothberg/trimright-slashes
  > Merge pull request #420 from mtrmac/docker-errors
  > Merge pull request #428 from vrothberg/trim-slashes
  > Merge pull request #424 from umohnani8/auth
  > Merge pull request #416 from mtrmac/305-cleanup
  > Merge pull request #423 from umohnani8/gzip
  > Merge pull request #408 from vbatts/platform
  > Merge pull request #419 from umohnani8/dir
  > Merge pull request #392 from giuseppe/ostree-support-copy-layer-infos
  > Merge pull request #418 from mtrmac/normalize-before-conversion
  > Merge pull request #415 from jonboulle/oci_fix
  > Merge pull request #411 from umohnani8/certs
  > Merge pull request #406 from umohnani8/search
  > Merge pull request #410 from ripcurld0/close_409
Submodule src/github.com/containers/libtrust 000000000...14b96171a (new submodule)
Submodule src/github.com/containers/ocicrypt 000000000...b87a4a69c (new submodule)
Submodule src/github.com/containers/storage 000000000...e21e7107c (new submodule)
Submodule src/github.com/docker/distribution 2461543d98..5f6282db7d (rewind):
  < Merge pull request #2824 from dmcgowan/update-version-file-2.7.1
  < Merge pull request #2805 from dmcgowan/release-2.7.1
  < Merge pull request #2821 from caervs/ISS-2819
  < Merge pull request #2804 from caervs/ISS-2793-2.7
  < Merge pull request #2802 from davidswu/2.7-autoredirect
  < Merge pull request #2775 from caervs/release_notes_2.7
  < Merge pull request #2772 from dmcgowan/add-arm-flag
  < Merge pull request #2711 from davidswu/autoredirect
  < Merge pull request #2742 from tescherm/configure-bugsnag-logging
  < Merge pull request #2734 from mgrachev/patch-1
  < Merge pull request #2735 from tonistiigi/update-dockerfile
  < Merge pull request #2729 from liyongxin/master
  < Merge pull request #2731 from mirake/fix-typos
  < Merge pull request #2730 from dmcgowan/version-update-2.7.0-rc.0
  < Merge pull request #2721 from dmcgowan/release-notes-2.7.0-rc
  < Merge pull request #2720 from dmcgowan/update-release-process
  < Merge pull request #2723 from mirake/fix-typos
  < Merge pull request #2707 from davidswu/go-1.11
  < Merge pull request #2712 from liyongxin/master
  < Merge pull request #2705 from mirake/fix-typo
  < Merge pull request #2631 from whoshuu/feature/improve-gcs-driver
  < Merge pull request #2704 from dmcgowan/fix-2703
  < Merge pull request #2702 from caervs/fix_path_enumeration
  < Merge pull request #2698 from cquon/swift_vendor
  < Merge pull request #2701 from davidswu/metalinter
  < Merge pull request #2362 from twistlock/populate_htpasswd
  < Merge pull request #2639 from andrew-leung/manifesteventlayers
  < Merge pull request #2685 from manishtomar/mani-graceful-shutdown
  < Merge pull request #2680 from manishtomar/mani-fix-mem-leak
  < Merge pull request #2474 from vikstrous/disable-v1-master
  < Merge pull request #2694 from caervs/fix_nginx_spacing
  < Merge pull request #2328 from stepyu/patch-1
  < Merge pull request #2368 from monikakatiyar16/patch-1
  < Merge pull request #2689 from dmp42/reviewers
  < Merge pull request #2690 from dmcgowan/remove-aaron-maintainers
  < Merge pull request #2651 from mikebrow/manifest-version-test-on-put
  < Merge pull request #2681 from dmcgowan/update-yaml
  < Merge pull request #2687 from davidswu/remove-region-validation
  < Merge pull request #2678 from caervs/bump_aws
  < Merge pull request #2679 from dmcgowan/revert-bad-vendor-update
  < Merge pull request #2674 from Parth/patch-1
  < Merge pull request #2673 from dmp42/TLS
  < Merge pull request #2671 from dmp42/goamz
  < Merge pull request #2632 from whoshuu/feature/improve-s3-driver
  < Merge pull request #2599 from legionus/storage-inmemory-deadlock
  < Merge pull request #2602 from dmage/errmessage
  < Merge pull request #2662 from dmcgowan/enable-structcheck
  < Merge pull request #2598 from dmcgowan/update-maintainers
  < Merge pull request #2661 from dmcgowan/enable-goimports
  < Merge pull request #2648 from manishtomar/tag-deleted-event
  < Merge pull request #2657 from dmcgowan/add-travis-ci
  < Merge pull request #2076 from mikebrow/ocitype
  < Merge pull request #2603 from yuwaMSFT2/newazuresdk
  < Merge pull request #2582 from thaJeztah/bump-golang-1.10
  < Merge pull request #2594 from dmcgowan/update-integration-test-certs
  < Merge pull request #2556 from justincormack/fewer-gorillas
  < Merge pull request #2538 from tianon/challanges
  < Merge pull request #2501 from xiaonancc77/master
  < Merge pull request #2549 from miminar/redis-delete-key
  < Merge pull request #2535 from xiaods/master
  < Merge pull request #2466 from tifayuki/prometheus_go_metrics
  < Merge pull request #2520 from ywk253100/ignore_path_not_found_error
  < Merge pull request #2469 from felixbuenemann/letsencrypt-host-whitelist-support
  < Merge pull request #2471 from GarageDeveloper/master
  < Merge pull request #2507 from mistyhacks/fix-storage-driver-link
  < Merge pull request #2455 from sargun/make-walk-faster
  < Merge pull request #2500 from stevvooe/remove-net-context-vestiges
  < Merge pull request #2302 from zetaab/oma
  < Merge pull request #2484 from perlun/patch-1
  < Merge pull request #2485 from perlun/patch-2
  < Merge pull request #2447 from tifayuki/cloudfront-s3-filter
  < Merge pull request #2435 from zhouhaibing089/digestset-ut
  < Merge pull request #2448 from kklin/follow-relative-links
  < Merge pull request #2431 from mistyhacks/fix-keywords
  < Revert "Add text about win os version ordering"
  < Add text about win os version ordering
  < Merge pull request #2428 from zhouhaibing089/algo-import
  < Merge pull request #2402 from smarterclayton/order_manifest
  < Merge pull request #2416 from stevvooe/feed-the-linter
  < Merge pull request #2377 from stevvooe/healthcheck-storage-ignore-notfound
  < Merge pull request #2388 from tiborvass/remove-misused-context
  < Merge pull request #2384 from smarterclayton/client_retry
  < Merge pull request #2382 from smarterclayton/scope_check
Submodule src/github.com/klauspost/compress 0000000000...788b7f06fe (new submodule)
Submodule src/github.com/klauspost/pgzip 0000000000...3286875a12 (new submodule)
Submodule src/github.com/opencontainers/image-spec 577479e4dc..775207bd45:
  > Merge pull request #790 from vrothberg/add-zstd-constants
  > Merge pull request #788 from giuseppe/zstd
  > Merge pull request #786 from estesp/meeting-ref
  > Merge pull request #757 from unasuke/fix_typo
  > Merge pull request #782 from odinuge/lint-issue
  > Merge pull request #772 from woernfl/patch-1
  > merge branch 'pr-771'
  > Merge pull request #759 from jzelinskie/loosen-mediatypes
  > merge branch 'pr-768'
  > merge branch 'pr-765'
  > merge branch 'pr-762'
  > Update MAINTAINERS emails
  > merge branch 'pr-755'
  > Merge pull request #753 from HaraldNordgren/go_versions
  > Merge pull request #740 from francescomari/fix-link
  > Merge pull request #734 from vbatts/release-v1.0.1
  > merge branch 'pr-750'
  > Merge pull request #716 from AkihiroSuda/index-is-required
  > merge branch 'pr-711'
  > Merge pull request #739 from mtrmac/id-based-loader
Submodule src/github.com/ulikunitz/xz 0000000000...067145b448 (new submodule)
@jkbschmid
Copy link

Hi everyone, I’d like to start the discussion again, since we are having problems with using the Firehose Plugin.

  • The Firehose plugin reads all log messages for every single application instance. On our large foundations, where metron agents can send a total of 100k msg/s this doesn’t work properly. Instead, we saw that Dopplers drop lots of messages.
  • In the case of Dopplers being overloaded, the warning message that the log limit has been exceeded is also dropped, so it would be great to have a second channel for informing users.

Logging the offending app names to Diego's logs on a set cadence seems like a reasonable solution for identifying the bad actors.

For our purpose, this would be an ideal solution. The code change would hopefully be very small change, since a log message is already emitted, just into application logs and not into the executor log. Even if the same cadence (1s) were used, the additional log load should be manageable, even for large foundations.

What do you think?

@aminjam
Copy link
Contributor

aminjam commented Sep 22, 2020

@jkbschmid Thanks for restarting the conversation. Can you clarify a few questions I had:

On our large foundations, where metron agents can send a total of 100k msg/s this doesn’t work properly. Instead, we saw that Dopplers drop lots of messages.

Can you please clarify:

  1. At the time of seeing this issue, was this feature turned on or off?
  2. If turned on, what was the value?
  3. Are there screenshots of KPIs for Dopplers?
  4. Have you tried using the noisy neighbor plugin for detection of the apps?

I remember seeing another issue in this area that might be related to what you are experiencing.

@aminjam aminjam reopened this Sep 23, 2020
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@jkbschmid
Copy link

jkbschmid commented Sep 23, 2020

Hi @aminjam,
thanks for the quick response!
1/2. The log rate limit was enabled with a (relatively high) limit of 5,000 logs/s and we could already see a positive effect (load on the dopplers decreased). We intentionally set it this high in oder to not impact well-behaving applications. We’d like to set a lower limit, but that’s exactly why we need the visibility.

  1. With 35 Dopplers in our foundation we are already at the scaling limit. The two spikes in "Doppler dropped messages" at 13:52 and 13:53 I got from starting the nozzle twice. Please let me know if I forgot an important metric:
    image

  2. Yes, we are also making use of the noisy neighbor nozzle. We have two problems, however. The NNN only reports based on logs per minute. If apps only have short “log spikes”, then their logs will be dropped, but they won’t appear in the NNN ranking (we observed such behaviour). We also observed that the NNN’s reports differed by a large amount (multiple 100k logs/min) from the actual throughput at the RLP.

I also had a look at the issue you linked but, since we are already at Doppler's scalability limit, this is unfortunately not an option for us.

@aminjam
Copy link
Contributor

aminjam commented Sep 23, 2020

This is quite interesting @jkbschmid! I am very interested in figuring out the spike for Doppler Dropped Per Second.

We intentionally set it this high in order to not impact well-behaving applications.

In our experience, 100 (default) log-lines/second seemed to have allowed the well-behaved applications to operate normally. We were under the impression that if an app emits 5000 log-lines/second, it fits in the noisy-neighbor category. We'd like to learn more about the rates selected by the users, can you share some insights about high rate selection?

With 35 Dopplers in our foundation we are already at the scaling limit.

It looks like Dopplers are receiving around 150K messages/second. I have heard of users being able to manage up to 15-20 Million/Minute (250K/Second) without an issue. I wonder if there are some tuning we could do to figure out what's causing the drop.

Looking at the Ingress rate, I don't see a very big jump during the same timeframe. A few follow up questions:

The NNN only reports based on logs per minute. If apps only have short “log spikes”, then their logs will be dropped, but they won’t appear in the NNN ranking

This is an interesting finding. I don't have a lot of background for the plugin, but I am wondering if we can fix that to report the noisy-neighbor properly.

@pianohacker I am curious to get your thoughts on the loggregator side of things.

@pianohacker
Copy link

Wanted to add a few useful bits of info:

  1. You can likely scale to 40 Doppler VMs, and try vertically scaling those as well. 40 is about where we've seen horizontal scaling stop producing better results.

  2. There's a confounding factor that may be present in your dashboards; Doppler (and many other Loggregator components) have two different metrics with the name dropped, distinguished by the direction tag:

    doppler{direction="ingress"} - Increments when the component is overloaded, and can't read incoming messages fast enough. Usually points to a noisy emitter somewhere upstream, or a VM with underscaled CPU.
    doppler{direction="egress"} - Increments when the component can't get information out the door fast enough. Usually points to something downstream being unable to keep up, or network/load balancer congestion.

    Your dashboard is likely mixing both metrics together, making it difficult to tell what kind of dropping Doppler is having to do. Based on the fact that your system is seeing fairly low dropped messages most of the time, I'm betting that Doppler is dropping messages on egress because your noisy neighbor nozzle is underscaled and can't keep up.

  3. It's possible to scale the noisy neighbor nozzle such that it can keep up with application traffic, but there's an easier way. We just released version 3.1.0 of the Log Cache CLI, which has a new and improved version of the cf log-meta --noise command. Log Cache is already running on your foundation at the necessary scale to support log access for the CF CLI, so it's a fairly trustworthy source of information about app log throughput.

Caveat: All of the above happen downstream from where Diego rate-limits logs, so they can't tell you anything in that circumstance.

Let me know if that helps!

@jkbschmid
Copy link

Thanks a lot for your detailed answers, @aminjam and @pianohacker, this was very helpful!

First of all, sorry for the misunderstanding:
There are basically two large foundations that I am working on, one with 35 Dopplers, the other with 40, with Dopplers ingressing around 250k msgs/s (per foundation).

I think the foundations you mentioned with regards to messages/second and Doppler scalability are exactly these two (most likely you have already had some discussions about them with Plamen Doychev or Jochen Ehret on Slack or via mail).


ingress rate for only diego-cells

I can’t provide the ingress rate filtered for diego-cells only, but we should see any spikes in the Dashboard for “Metron Agents Overall Sent Messages” under "Doppler overall received" as well, right?

Firehose performance scaling KPIs

Log Transport Loss Rate = 0 loggregator.doppler.dropped{direction=ingress} / X -> 0
Doppler Message Rate Capacity = ~150k msgs/s / 35 Dopplers = ~4,300 msgs/(s*Doppler)
Reverse Log Proxy Loss Rate = 0 dropped / X -> 0

can you share some insights about high rate selection?


We chose 5000 lines/s as the initial value in order to limit as few applications as possible. Since there are 40k app instances running on our foundation, the usage and log patterns are quite diverse. The sheer number of apps makes it hard for us to check individually whether a certain log usage pattern is actually required by the stakeholder or whether it should be limited. We’d like to identify and continuously monitor the apps whose logs are dropped to

  • get insights whether they require this amount of logs or whether this is unintentional (this is where our requirements go beyond NNN and Log Cache CLI, since we’d like to monitor log spikes as well)
  • explicitly notify stakeholders if they produce large log loads

@jkbschmid
Copy link

jkbschmid commented Sep 28, 2020

...continued, in response to @pianohacker:

  1. As mentioned above, we are at 40 Dopplers on the other landscape.

  2. Thanks for the clarification, that was super helpful! This is indeed a confounding factor, which displays both directions in the same Dashboard. As you correctly predicted, the spike was for doppler{direction="egress”}. Seems like the Firehose Nozzle is underscaled and cannot keep up with the amount of logs.

  3. Thanks for the pointer to the Log Cache CLI! Great to have a solution which doesn’t require additional deployments. We already tried it out and it looks like a great way of getting more insight into the logging system without any additional computational requirements.

With the plugins that have been mentioned (and maybe some scaling/fixes), it seems like it is possible to identify apps being hit by the rate limit.
There are three advantages I see of something like enriching the infra log over using plugins.

  • Apps with irregular log spikes are registered (and I guess it is natural for an app to have peak loads and therefore peak log loads)
  • The information does not pass the loggregator itself (where it may be dropped if the loggregator is overloaded), but is routed via rsyslog
  • If there are external logging system that collect platform logs, they will also persist this piece of information, so it can be analysed at a later point in time

Thanks again for your help so far!

@sunjayBhatia
Copy link
Contributor Author

@jkbschmid sounds like you would like to see the rep logs print out some details when an app exceeds the log rate limit, I personally think that is reasonable and agree with the advantages you mentioned above

@aminjam
Copy link
Contributor

aminjam commented Oct 6, 2020

I am glad that the Log Cache CLI was able to solve the problem after scaling the nozzle properly. If there was to be a log that identified a container-guid reaching the log rate-limit in rep reporting on a set cadence, that would tell us that this instance of the app reached the limit sometime during the interval.

Apps with irregular log spikes are registered (and I guess it is natural for an app to have peak loads and therefore peak log loads)

We wouldn't able to tell the exact time when it reached the limit and would still have to look at the Firehose Nozzle for this information

The information does not pass the loggregator itself (where it may be dropped if the loggregator is overloaded), but is routed via rsyslog

This is true and it will ensure the logs are delivered in rsyslog.

If a container-guid is logged (a single instance of an app), would it be useful and easy to use that information to map it back to the app?

@jkbschmid If you still feel that using this extra piece of information in the log would help with diagnosing the issue faster, I think it would make sense to log the offending containers on a set cadence. Would you be open to a PR with the above change?

Given that you are running a large foundation, I think having a blog post would also be extremely valuable when it comes to using this feature with Log Cache CLI and these potential improvements. Would you be open in creating a post describing the problems you've had and how you were able to figure out and mitigate the issue in your foundation?

@jkbschmid
Copy link

Hi @aminjam,
I've had a more detailed look at the code and, if my understanding is correct, then there is even an app GUID available in the tags (see here). The problem for us as operators is, that we’d need to enrich this information at least with Organization info (which is not available) in order to make it concretely usable.

The main problems we need to solve are:

  1. Understand which apps are limited by the log rate limit
  2. Identify stakeholders that cause majority of the load on the logging system
  3. Identify and alert stakeholders of apps that are rate limited

For 2), we would need at least the Organizations, for 3) the org or space users of the identified apps as well. So even if we had a log with an app GUID, we wouldn’t be able to diagnose the issue much faster than with the Log Cache CLI plugin. I could still provide a PR, though, if you deem this feature important. Otherwise I’d be okay with closing this issue and using the plugin.

Of course, the blog post idea sounds great! Not sure if this is still blog post material without any change in Diego, but if you’re still interested you can just shoot me a message (via mail or slack) with some details:)

Thanks again for your help!

@aminjam
Copy link
Contributor

aminjam commented Oct 19, 2020

@jkbschmid Adding organization and space name to the rep logs is not a common pattern that we do in other places or components. Even if we passed in a logger object, those type of information can get outdated when a user runs cf rename for an org or space. Since we can identify the app using Log Cache CLI, I am leaning toward leaving the capabilities as is.

@aminjam aminjam closed this as completed Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants