New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST]: As a CF Operator, I expect that when I observe the "AppInstanceExceededLogRateLimitCount" metric, I can see the app instance details that caused this value to be incremented so I can take action with the app owner #457
Comments
In lieu of emitting app instance details in the log message, we have come up with some First, install the firehose and log-cache plugins:
This command will show when any app has exceeded the log rate: And this command will show the specific app which has exceeded the rate: |
I am not suggesting emitting app instance details in the log message, but instead tagging the counter metric (here) we emit with the same tags as the log message we put in the app log stream so someone aggregating metrics can drill down into the metric values for individual apps. This would change the source/instance id of the metric to be from the app instance instead of the cell I'm also curious how this |
reopening so we can continue to chat about this |
If we add a tag that adds application_instance_id and application_guid, the AppInstanceExceededLogRateLimitCount metric will have an enormous cardinality. This will create a large storage burden on downstream monitoring systems. Given that we expect operators will want to rectify the bad actors quickly, we would be storing a lot of metrics that are not being used. I think that we should have the count be by diego cell only. Logging the offending app names to Diego's logs on a set cadence seems like a reasonable solution for identifying the bad actors. |
@sunjayBhatia I don't understand why the operator is the persona of interest in this case. I thought the whole purpose of this story was to avoid having applications negatively affect their neighbors. I think the app developer will be more interested in knowing that their application logs are getting dropped.
@robertjsullivan How is this different from the CPU and Memory metrics that we are currently emitting? |
for CPU and memory, we end up emitting the application_id as the source_id (https://github.com/cloudfoundry/diego-logging-client/blob/88015a7/client.go#L217), as opposed to keeping the source_id as "rep", but emitting application level information under that metric |
I would think we would also want to change the source id to be the application id to make the metric useful to app developers if we were going to tag it as suggested @jvshahid yeah, that is a good point, this could be reframed as an app developer facing feature (so they can keep track of their log usage, sort of another resource usage metric to key off of), or maybe we can conclude that app developers have enough info for now with the log we push into the log stream telling them their app has been rate limited |
The cardinality isn't different than the cpu metric and memory metrics but I think those are more valuable as that data is constantly relevant. In this case, it's a counter that we'd be emitting for every instance, but only relevant to a small few. If we wanted to emit a logging metric for applications, I think we should emit the total count of log lines from a container. An operator could use the rate() function to determine if it's violating the set threshold. App Devs would also find that useful as well. |
@sunjayBhatia We tested this in a large foundation and were able to detect the apps logging in as an admin. Closing this issue since we have enough information for app developers to know their app is being rate limited. Please feel free to re-open this if you'd like to see a different outcome. |
…tion compress pgzip image-spec xz Submodule src/code.cloudfoundry.org/dockerapplifecycle 4d909ee2..ed3c37b3: > Updates for compilation with latest versions of dependencies Submodule src/github.com/BurntSushi/toml 000000000...3012a1dbe (new submodule) Submodule src/github.com/containers/image 495da41bd..4276a4ff6: > v5.3.0 > Merge pull request #854 from mtrmac/arch-variant > Merge pull request #820 from nalind/manifest-list-retry-using-conversions > Merge pull request #853 from mtrmac/docker-tarfile-crash > Update containers/storage to v1.16.5 > oci manifest: do not require whitelist of layer types > Deprecate SupportsEncryption call in manifest interface > Merge pull request #829 from yihuaf/dev/yihuaf/authfile > Merge pull request #849 from containers/dependabot/go_modules/github.com/klauspost/pgzip-1.2.2 > Merge pull request #848 from rhatdan/readme > Merge pull request #836 from lumjjb/change_updated_image > containers-registries.conf.d: refine semantics > Merge pull request #845 from containers/dependabot/go_modules/github.com/containers/storage-1.16.3 > Merge pull request #842 from knqyf263/add_registry_token > Merge pull request #846 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.3 > Merge pull request #847 from containers/dependabot/go_modules/github.com/opencontainers/selinux-1.4.0 > Merge pull request #831 from kandaaaaa/master > Bump github.com/containers/storage from 1.16.1 to 1.16.2 > Bump github.com/vbauerster/mpb/v4 from 4.12.1 to 4.12.2 > Merge pull request #793 from vrothberg/rewrite-message > Merge pull request #821 from TomSweeneyRedHat/dev/tsweeney/authfile > Merge pull request #837 from containers/dependabot/go_modules/github.com/containers/storage-1.16.1 > Add $HOME/.config/containers/certs.d to perHostCertDirPath > Add $HOME/.config/containers/registries.conf to config path > Merge pull request #832 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.2 > Bump github.com/opencontainers/selinux from 1.3.2 to 1.3.3 > Merge pull request #834 from vrothberg/registries.conf.d > Merge pull request #830 from containers/dependabot/go_modules/github.com/ulikunitz/xz-0.5.7 > Merge pull request #827 from containers/dependabot/go_modules/github.com/stretchr/testify-1.5.1 > Merge pull request #828 from containers/dependabot/go_modules/github.com/opencontainers/selinux-1.3.2 > Merge pull request #822 from containers/dependabot/go_modules/github.com/containers/storage-1.16.0 > Merge pull request #823 from containers/dependabot/go_modules/github.com/vbauerster/mpb/v4-4.12.1 > Merge pull request #824 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.1 > Merge pull request #825 from containers/dependabot/go_modules/github.com/stretchr/testify-1.5.0 > Merge pull request #818 from containers/dependabot/go_modules/github.com/klauspost/compress-1.10.0 > Merge pull request #816 from ricardomaraschini/bubble-up-error > Add storage.Transport.NewStoreReference > Add validateImageID > Simplify ID value check > Add storage.Transport.GetStoreIfSet > Add tarball.NewReference > Remove tarballReference.transport > Add docker/archive.NewReference > Re-order the members of archiveReference to a more logical order > Merge pull request #815 from TomSweeneyRedHat/coc > Fixed registries.conf > Merge pull request #812 from vrothberg/release > Bump github.com/mtrmac/gpgme from 0.1.1 to 0.1.2 > Merge pull request #809 from nalind/list-conversion > bump to v5.2.1-dev > v5.2.0 > Merge pull request #806 from containers/dependabot/go_modules/github.com/containers/storage-1.15.8 > Add tests for internal/iolimits > Also use ReadAtMost in docker/tarfile > Simplify error checking in iolimits.ReadAtMost > restrict sizes of inmem processed blobs > copy: relax platform checks > Merge pull request #777 from SUSE/registry-mirror-docker-docs > Merge pull request #802 from bkuschel/master > Merge pull request #800 from containers/dependabot/go_modules/github.com/opencontainers/selinux-1.3.1 > Merge pull request #798 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.8 > Merge pull request #796 from mtrmac/by-pointer > Merge pull request #795 from containers/dependabot/go_modules/github.com/containers/storage-1.15.7 > Merge pull request #794 from mtrmac/update-gpgme > Merge pull request #791 from containers/dependabot/go_modules/github.com/pkg/errors-0.9.1 > Bump github.com/pkg/errors from 0.8.1 to 0.9.0 > Merge pull request #788 from mtrmac/mirror-errors > Merge pull request #771 from sseago/docker-image-dest-schema1-config > Merge pull request #774 from mtrmac/retry-cleanups > Merge pull request #787 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.7 > Merge pull request #785 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.5 > Merge pull request #782 from containers/dependabot/go_modules/github.com/containers/storage-1.15.5 > Bump github.com/containers/storage from 1.15.3 to 1.15.4 > Bump github.com/vbauerster/mpb/v4 from 4.11.1 to 4.11.2 > Merge pull request #769 from TomSweeneyRedHat/dev/tsweeney/registries.confup > Merge pull request #779 from rhatdan/version > Merge pull request #773 from containers/dependabot/go_modules/github.com/containers/storage-1.15.3 > Merge pull request #768 from containers/dependabot/go_modules/github.com/containers/storage-1.15.2 > Merge pull request #767 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.4 > Enable Cirrus validation tests > make test-skopeo: only run unit tests > Makefile: split tools installation > internal/pkg/keyctl: drop deprecation warning > keyctl: tests: check errors > update github.com/vbauerster/mpb to v4 > Merge pull request #763 from containers/dependabot/go_modules/github.com/containers/storage-1.15.1 > storage_test.go: check storage interfaces > Merge pull request #761 from containers/dependabot/go_modules/github.com/klauspost/compress-1.9.3 > Merge pull request #759 from containers/dependabot/go_modules/github.com/containers/storage-1.15.0 > Merge pull request #731 from bege13mot/golangci-lint > Merge pull request #673 from lumjjb/encryption-policy > Merge pull request #741 from containers/dependabot/go_modules/github.com/imdario/mergo-0.3.8 > Merge pull request #740 from containers/dependabot/go_modules/github.com/docker/docker-credential-helpers-0.6.3 > Merge pull request #754 from containers/dependabot/go_modules/github.com/containers/storage-1.14.0 > Merge pull request #738 from SUSE/interface-fix > Merge pull request #752 from mtrmac/enforce-architecture > Merge pull request #744 from mtrmac/test-failures > Merge pull request #747 from Zyqsempai/495-allow-to-overwrite-default-tmp-dir > Merge pull request #732 from SUSE/reader-adaptions > Merge pull request #735 from mtrmac/release-5 > Merge pull request #719 from QiWang19/resperrpr > Merge pull request #734 from nalind/choose-source > Merge pull request #400 from nalind/manifest-lists > Merge pull request #703 from vrothberg/status-checks > Merge pull request #722 from gabemontero/add-openshift-dockercfg-secret-mount-non-home > Merge pull request #727 from mtrmac/test-skopeo-v4 > Merge pull request #726 from mtrmac/release-4.0.1 > Merge pull request #721 from mtrmac/go-mod-major > Merge pull request #725 from nalind/guess-with-no-layers > Merge pull request #723 from vrothberg/conversion > Merge pull request #718 from rhatdan/master > Merge pull request #716 from mtrmac/compression-types > Merge pull request #715 from mtrmac/containers-libtrust > Merge pull request #717 from vrothberg/update-storage > Merge pull request #714 from mtrmac/update-mergo > Merge pull request #709 from QiWang19/resplog > Merge pull request #701 from lsm5/master-openssl > Merge pull request #711 from nalind/update-accepted-certs > Merge pull request #708 from vrothberg/replace > Merge pull request #705 from vrothberg/fix-annotations > Merge pull request #683 from QiWang19/rm_all_keyring > Merge pull request #563 from vrothberg/copy-update-mime-types > Merge pull request #693 from vrothberg/post-mod > Merge pull request #682 from praseodym/go-mod > Merge pull request #686 from nalind/keyctl-tests > Merge pull request #690 from vrothberg/fix-nil-check > Merge pull request #685 from nalind/tmpdir > Merge pull request #691 from vrothberg/fix-ci > Merge pull request #688 from vrothberg/add-nil-check > Merge pull request #684 from vrothberg/copy-default-to-gzip > Merge pull request #639 from giuseppe/zstd > Merge pull request #681 from mtrmac/3.0.2 > Merge pull request #680 from QiWang19/disable_keyring > Merge pull request #678 from mtrmac/ostree-didnt-work-and-nobody-noticed > Merge pull request #677 from vrothberg/bump-version > Merge pull request #676 from mtrmac/keyctl-followup > Merge pull request #675 from nalind/keyctl-arches > Merge pull request #672 from vrothberg/bump-version > Merge pull request #630 from SUSE/registry-cache > Merge pull request #670 from SUSE/gocache-fix > Merge pull request #669 from mtrmac/bearer-token-tls > Merge pull request #667 from TomSweeneyRedHat/dev/tsweeney/touchupstore > Merge pull request #668 from QiWang19/keyring_authfile > Merge pull request #663 from mtrmac/drop-sysregistries-v1 > Merge pull request #658 from vrothberg/enforce-blocking > Merge pull request #661 from rhatdan/install > Merge pull request #660 from rhatdan/vendor > Merge pull request #619 from QiWang19/key > Merge pull request #659 from vrothberg/release > Merge pull request #657 from vrothberg/use-spinners > Merge pull request #587 from Silvanoc/default-ostree-stub > Merge pull request #652 from jvanz/248 > Merge pull request #651 from jvanz/issue510 > Merge pull request #650 from mtrmac/docker-delete-comment > Merge pull request #648 from SUSE/typo-fix > Merge pull request #647 from vrothberg/delete-oci > Merge pull request #583 from mtrmac/storage-TryReusingBlob-harder > Merge pull request #637 from mtrmac/beautify-registries-toml > Merge pull request #576 from mtrmac/tarfile-cleanup > Merge pull request #646 from vrothberg/delete-oci > Merge pull request #644 from giuseppe/fix-selinux-ostree-err > Merge pull request #643 from mtrmac/2.0 > Merge pull request #617 from mtrmac/mirror-cleanups > Merge pull request #635 from mtrmac/registries-v2-explicit-search > Merge pull request #629 from vrothberg/mirror-digest-only > Merge pull request #636 from mtrmac/sysregistriesv2-test-fixtures > Merge pull request #631 from TristanCacqueray/master > Merge pull request #626 from wking/normalize-registry-test-flake > Merge pull request #628 from mtrmac/mirror-cleanups-code > Merge pull request #625 from mtrmac/mirror-no-credentials > Merge pull request #627 from vrothberg/update-ubuntu > Merge pull request #623 from QiWang19/authComment > Merge pull request #622 from rhatdan/atomic > Merge pull request #621 from Silvanoc/silvano/go-binaries-not-in-path > Merge pull request #593 from wking/docker-config-test-fixtures > Merge pull request #615 from vrothberg/transports-docs > Merge pull request #564 from SUSE/docker-client-mirror > Merge pull request #616 from vrothberg/update-bolt > Merge pull request #597 from wking/GetAuthentication-for-logged-in > Merge pull request #613 from giuseppe/fix-namespace > Merge pull request #612 from giuseppe/fix-blobinfocache-rootless > Merge pull request #609 from rhatdan/version > Merge pull request #605 from rhatdan/vendor > Merge pull request #598 from wking/drop-redundant-domain-call > Merge pull request #602 from wking/blob-info-boltdb-separation > Merge pull request #603 from vrothberg/progress-bar-messages > Merge pull request #599 from wking/pat-path-typo > Merge pull request #600 from wking/drop-contributing-asterisks > Merge pull request #601 from vrothberg/fix-build > Merge pull request #594 from vrothberg/certs.d-manpage > Merge pull request #589 from wking/move-docker-config-tests > Merge pull request #592 from mtrmac/bic-memory > Merge pull request #591 from vrothberg/bump-version > Merge pull request #590 from vrothberg/sync-bic > Merge pull request #586 from mtrmac/mpg-fixes > Merge pull request #581 from vrothberg/mpb-progress-bars > Merge pull request #585 from nalind/storage-multiple-manifests > Merge pull request #573 from nalind/manifest-match > Merge pull request #584 from markmc/ecr-oci-invalid-json > Merge pull request #577 from arjun024/fix-resp-sigsegv > Merge pull request #579 from mtrmac/ostree-no-pipe > Merge pull request #578 from nalind/ostree-hang > Merge pull request #569 from umohnani8/export > Merge pull request #568 from teozkr/threadsafe-docker-archive-getblob > Merge pull request #565 from SUSE/log-cleanup > Merge pull request #557 from vrothberg/parallel-push > Merge pull request #561 from rhatdan/version > Merge pull request #558 from vrothberg/progress-bar-tty-check > Merge pull request #555 from mtrmac/home-gopath > Merge pull request #556 from vrothberg/add-vrothberg-to-maintainers > Merge pull request #553 from vrothberg/avoid-rc > Merge pull request #551 from vrothberg/progressbaroutput > Merge pull request #549 from vrothberg/lint > Merge pull request #545 from vrothberg/parapull > Merge pull request #543 from giuseppe/use-optimized-gzip > Merge pull request #542 from vrothberg/vendor-ostree > Merge pull request #534 from mtrmac/prefix-match > Merge pull request #541 from vrothberg/govet > Merge pull request #523 from kinvolk/dongsu/upgrade-go-1.11 > Merge pull request #536 from mtrmac/blob-info-caching > Merge pull request #468 from vrothberg/regsv2-docker > Merge pull request #533 from darkowlzz/godoc-fix > Merge pull request #525 from mtrmac/go-selinux > Merge pull request #520 from mtrmac/search-order > Merge pull request #522 from kinvolk/dongsu/fix-test-errors > Merge pull request #519 from nalind/copy-digest > Merge pull request #341 from ashcrow/fix-pkgs-in-readme > Merge pull request #521 from runcom/fix-build-golint > Merge pull request #517 from mtrmac/pingV1-no-error > Merge pull request #516 from mtrmac/http-statustext > Merge pull request #515 from mtrmac/run-containers-error-handling > Merge pull request #512 from nalind/missing-ca-certs > Merge pull request #508 from nalind/ping-error > Merge pull request #505 from nalind/difflock > Merge pull request #506 from nalind/duplicate-layer > Merge pull request #503 from nlewo/fix-pause > Merge pull request #502 from umohnani8/search > Merge pull request #501 from lucab/ups/ostree-commit-errcheck > Merge pull request #500 from rhatdan/vendor > Merge pull request #489 from mtrmac/certs.d-multiple-ca-files > Merge pull request #499 from mtrmac/cleanups > Merge pull request #487 from mtrmac/delete-permissions > Merge pull request #498 from mtrmac/git-validation-epoch > Merge pull request #497 from vrothberg/fix-conf-references > Merge pull request #486 from Silvanoc/master > Merge pull request #493 from vrothberg/rename-docs > Merge pull request #491 from nalind/healthconfig-start-period > Merge pull request #490 from rhatdan/skopeo > Merge pull request #483 from nalind/docker-daemon-stub > Merge pull request #481 from mtrmac/docker-archive-auto-compression > Merge pull request #472 from mtrmac/storage-emptylayers > Merge pull request #480 from SUSE/send-account-field-to-authenticated-docker-registries > Merge pull request #478 from praveenkumar/issue_475 > Merge pull request #474 from rhatdan/contributing > Merge pull request #473 from mtrmac/auth.json-path > Merge pull request #470 from umohnani8/search > Merge pull request #465 from TomSweeneyRedHat/dev/tsweeney/registriesman > Merge pull request #463 from vbatts/platform.2 > Merge pull request #467 from vrothberg/sysregsv2-config-cache > Merge pull request #466 from mtrmac/sysregistriesv2-comments > Merge pull request #464 from vrothberg/cleanup-sysregsv2 > Merge pull request #426 from mtrmac/305-cleanup-2 > Merge pull request #455 from agonzalezro/wrap-nil > Merge pull request #462 from rhatdan/spelling > Merge pull request #461 from giuseppe/pinpoint-ostree-thread > Merge pull request #433 from rhatdan/registries.conf > Merge pull request #446 from mgoltzsche/docker-dependency-update > Merge pull request #459 from mtrmac/c-storage-dep > Merge pull request #458 from umohnani8/transport > Merge pull request #454 from vrothberg/https-fallback > Merge pull request #457 from mtrmac/docker-err-ignored > Merge pull request #456 from nlewo/pr-tempDir > Merge pull request #450 from vrothberg/multitags > Merge pull request #301 from mtrmac/daemon-identity > Merge pull request #441 from novas0x2a/follow-tags > Merge pull request #451 from umohnani8/auth > Merge pull request #429 from vrothberg/mirror-support > Merge pull request #444 from agonzalezro/remove-deps-from-vendor > Merge pull request #439 from umohnani8/transport > Merge pull request #440 from mtrmac/context-cleanups > Merge pull request #431 from novas0x2a/context-everywhere > Merge pull request #435 from rhatdan/policy.json > Merge pull request #436 from rhatdan/vendor > Merge pull request #434 from jlsalmon/docker-daemon-plain-http > Merge pull request #370 from nlewo/pr/docker-archive > Merge pull request #430 from vrothberg/trimright-slashes > Merge pull request #420 from mtrmac/docker-errors > Merge pull request #428 from vrothberg/trim-slashes > Merge pull request #424 from umohnani8/auth > Merge pull request #416 from mtrmac/305-cleanup > Merge pull request #423 from umohnani8/gzip > Merge pull request #408 from vbatts/platform > Merge pull request #419 from umohnani8/dir > Merge pull request #392 from giuseppe/ostree-support-copy-layer-infos > Merge pull request #418 from mtrmac/normalize-before-conversion > Merge pull request #415 from jonboulle/oci_fix > Merge pull request #411 from umohnani8/certs > Merge pull request #406 from umohnani8/search > Merge pull request #410 from ripcurld0/close_409 Submodule src/github.com/containers/libtrust 000000000...14b96171a (new submodule) Submodule src/github.com/containers/ocicrypt 000000000...b87a4a69c (new submodule) Submodule src/github.com/containers/storage 000000000...e21e7107c (new submodule) Submodule src/github.com/docker/distribution 2461543d98..5f6282db7d (rewind): < Merge pull request #2824 from dmcgowan/update-version-file-2.7.1 < Merge pull request #2805 from dmcgowan/release-2.7.1 < Merge pull request #2821 from caervs/ISS-2819 < Merge pull request #2804 from caervs/ISS-2793-2.7 < Merge pull request #2802 from davidswu/2.7-autoredirect < Merge pull request #2775 from caervs/release_notes_2.7 < Merge pull request #2772 from dmcgowan/add-arm-flag < Merge pull request #2711 from davidswu/autoredirect < Merge pull request #2742 from tescherm/configure-bugsnag-logging < Merge pull request #2734 from mgrachev/patch-1 < Merge pull request #2735 from tonistiigi/update-dockerfile < Merge pull request #2729 from liyongxin/master < Merge pull request #2731 from mirake/fix-typos < Merge pull request #2730 from dmcgowan/version-update-2.7.0-rc.0 < Merge pull request #2721 from dmcgowan/release-notes-2.7.0-rc < Merge pull request #2720 from dmcgowan/update-release-process < Merge pull request #2723 from mirake/fix-typos < Merge pull request #2707 from davidswu/go-1.11 < Merge pull request #2712 from liyongxin/master < Merge pull request #2705 from mirake/fix-typo < Merge pull request #2631 from whoshuu/feature/improve-gcs-driver < Merge pull request #2704 from dmcgowan/fix-2703 < Merge pull request #2702 from caervs/fix_path_enumeration < Merge pull request #2698 from cquon/swift_vendor < Merge pull request #2701 from davidswu/metalinter < Merge pull request #2362 from twistlock/populate_htpasswd < Merge pull request #2639 from andrew-leung/manifesteventlayers < Merge pull request #2685 from manishtomar/mani-graceful-shutdown < Merge pull request #2680 from manishtomar/mani-fix-mem-leak < Merge pull request #2474 from vikstrous/disable-v1-master < Merge pull request #2694 from caervs/fix_nginx_spacing < Merge pull request #2328 from stepyu/patch-1 < Merge pull request #2368 from monikakatiyar16/patch-1 < Merge pull request #2689 from dmp42/reviewers < Merge pull request #2690 from dmcgowan/remove-aaron-maintainers < Merge pull request #2651 from mikebrow/manifest-version-test-on-put < Merge pull request #2681 from dmcgowan/update-yaml < Merge pull request #2687 from davidswu/remove-region-validation < Merge pull request #2678 from caervs/bump_aws < Merge pull request #2679 from dmcgowan/revert-bad-vendor-update < Merge pull request #2674 from Parth/patch-1 < Merge pull request #2673 from dmp42/TLS < Merge pull request #2671 from dmp42/goamz < Merge pull request #2632 from whoshuu/feature/improve-s3-driver < Merge pull request #2599 from legionus/storage-inmemory-deadlock < Merge pull request #2602 from dmage/errmessage < Merge pull request #2662 from dmcgowan/enable-structcheck < Merge pull request #2598 from dmcgowan/update-maintainers < Merge pull request #2661 from dmcgowan/enable-goimports < Merge pull request #2648 from manishtomar/tag-deleted-event < Merge pull request #2657 from dmcgowan/add-travis-ci < Merge pull request #2076 from mikebrow/ocitype < Merge pull request #2603 from yuwaMSFT2/newazuresdk < Merge pull request #2582 from thaJeztah/bump-golang-1.10 < Merge pull request #2594 from dmcgowan/update-integration-test-certs < Merge pull request #2556 from justincormack/fewer-gorillas < Merge pull request #2538 from tianon/challanges < Merge pull request #2501 from xiaonancc77/master < Merge pull request #2549 from miminar/redis-delete-key < Merge pull request #2535 from xiaods/master < Merge pull request #2466 from tifayuki/prometheus_go_metrics < Merge pull request #2520 from ywk253100/ignore_path_not_found_error < Merge pull request #2469 from felixbuenemann/letsencrypt-host-whitelist-support < Merge pull request #2471 from GarageDeveloper/master < Merge pull request #2507 from mistyhacks/fix-storage-driver-link < Merge pull request #2455 from sargun/make-walk-faster < Merge pull request #2500 from stevvooe/remove-net-context-vestiges < Merge pull request #2302 from zetaab/oma < Merge pull request #2484 from perlun/patch-1 < Merge pull request #2485 from perlun/patch-2 < Merge pull request #2447 from tifayuki/cloudfront-s3-filter < Merge pull request #2435 from zhouhaibing089/digestset-ut < Merge pull request #2448 from kklin/follow-relative-links < Merge pull request #2431 from mistyhacks/fix-keywords < Revert "Add text about win os version ordering" < Add text about win os version ordering < Merge pull request #2428 from zhouhaibing089/algo-import < Merge pull request #2402 from smarterclayton/order_manifest < Merge pull request #2416 from stevvooe/feed-the-linter < Merge pull request #2377 from stevvooe/healthcheck-storage-ignore-notfound < Merge pull request #2388 from tiborvass/remove-misused-context < Merge pull request #2384 from smarterclayton/client_retry < Merge pull request #2382 from smarterclayton/scope_check Submodule src/github.com/klauspost/compress 0000000000...788b7f06fe (new submodule) Submodule src/github.com/klauspost/pgzip 0000000000...3286875a12 (new submodule) Submodule src/github.com/opencontainers/image-spec 577479e4dc..775207bd45: > Merge pull request #790 from vrothberg/add-zstd-constants > Merge pull request #788 from giuseppe/zstd > Merge pull request #786 from estesp/meeting-ref > Merge pull request #757 from unasuke/fix_typo > Merge pull request #782 from odinuge/lint-issue > Merge pull request #772 from woernfl/patch-1 > merge branch 'pr-771' > Merge pull request #759 from jzelinskie/loosen-mediatypes > merge branch 'pr-768' > merge branch 'pr-765' > merge branch 'pr-762' > Update MAINTAINERS emails > merge branch 'pr-755' > Merge pull request #753 from HaraldNordgren/go_versions > Merge pull request #740 from francescomari/fix-link > Merge pull request #734 from vbatts/release-v1.0.1 > merge branch 'pr-750' > Merge pull request #716 from AkihiroSuda/index-is-required > merge branch 'pr-711' > Merge pull request #739 from mtrmac/id-based-loader Submodule src/github.com/ulikunitz/xz 0000000000...067145b448 (new submodule)
Hi everyone, I’d like to start the discussion again, since we are having problems with using the Firehose Plugin.
For our purpose, this would be an ideal solution. The code change would hopefully be very small change, since a log message is already emitted, just into application logs and not into the executor log. Even if the same cadence (1s) were used, the additional log load should be manageable, even for large foundations. What do you think? |
@jkbschmid Thanks for restarting the conversation. Can you clarify a few questions I had:
Can you please clarify:
I remember seeing another issue in this area that might be related to what you are experiencing. |
We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story. The labels on this github issue will be updated when the story is started. |
Hi @aminjam,
I also had a look at the issue you linked but, since we are already at Doppler's scalability limit, this is unfortunately not an option for us. |
This is quite interesting @jkbschmid! I am very interested in figuring out the spike for Doppler Dropped Per Second.
In our experience, 100 (default) log-lines/second seemed to have allowed the well-behaved applications to operate normally. We were under the impression that if an app emits 5000 log-lines/second, it fits in the noisy-neighbor category. We'd like to learn more about the rates selected by the users, can you share some insights about high rate selection?
It looks like Dopplers are receiving around 150K messages/second. I have heard of users being able to manage up to 15-20 Million/Minute (250K/Second) without an issue. I wonder if there are some tuning we could do to figure out what's causing the drop. Looking at the Ingress rate, I don't see a very big jump during the same timeframe. A few follow up questions:
This is an interesting finding. I don't have a lot of background for the plugin, but I am wondering if we can fix that to report the noisy-neighbor properly. @pianohacker I am curious to get your thoughts on the loggregator side of things. |
Wanted to add a few useful bits of info:
Caveat: All of the above happen downstream from where Diego rate-limits logs, so they can't tell you anything in that circumstance. Let me know if that helps! |
Thanks a lot for your detailed answers, @aminjam and @pianohacker, this was very helpful! First of all, sorry for the misunderstanding: I think the foundations you mentioned with regards to messages/second and Doppler scalability are exactly these two (most likely you have already had some discussions about them with Plamen Doychev or Jochen Ehret on Slack or via mail).
I can’t provide the ingress rate filtered for diego-cells only, but we should see any spikes in the Dashboard for “Metron Agents Overall Sent Messages” under "Doppler overall received" as well, right?
Log Transport Loss Rate = 0 loggregator.doppler.dropped{direction=ingress} / X -> 0
We chose 5000 lines/s as the initial value in order to limit as few applications as possible. Since there are 40k app instances running on our foundation, the usage and log patterns are quite diverse. The sheer number of apps makes it hard for us to check individually whether a certain log usage pattern is actually required by the stakeholder or whether it should be limited. We’d like to identify and continuously monitor the apps whose logs are dropped to
|
...continued, in response to @pianohacker:
With the plugins that have been mentioned (and maybe some scaling/fixes), it seems like it is possible to identify apps being hit by the rate limit.
Thanks again for your help so far! |
@jkbschmid sounds like you would like to see the rep logs print out some details when an app exceeds the log rate limit, I personally think that is reasonable and agree with the advantages you mentioned above |
I am glad that the Log Cache CLI was able to solve the problem after scaling the nozzle properly. If there was to be a log that identified a container-guid reaching the log rate-limit in rep reporting on a set cadence, that would tell us that this instance of the app reached the limit sometime during the interval.
We wouldn't able to tell the exact time when it reached the limit and would still have to look at the Firehose Nozzle for this information
This is true and it will ensure the logs are delivered in rsyslog. If a container-guid is logged (a single instance of an app), would it be useful and easy to use that information to map it back to the app? @jkbschmid If you still feel that using this extra piece of information in the log would help with diagnosing the issue faster, I think it would make sense to log the offending containers on a set cadence. Would you be open to a PR with the above change? Given that you are running a large foundation, I think having a blog post would also be extremely valuable when it comes to using this feature with Log Cache CLI and these potential improvements. Would you be open in creating a post describing the problems you've had and how you were able to figure out and mitigate the issue in your foundation? |
Hi @aminjam, The main problems we need to solve are:
For 2), we would need at least the Organizations, for 3) the org or space users of the identified apps as well. So even if we had a log with an app GUID, we wouldn’t be able to diagnose the issue much faster than with the Log Cache CLI plugin. I could still provide a PR, though, if you deem this feature important. Otherwise I’d be okay with closing this issue and using the plugin. Of course, the blog post idea sounds great! Not sure if this is still blog post material without any change in Diego, but if you’re still interested you can just shoot me a message (via mail or slack) with some details:) Thanks again for your help! |
@jkbschmid Adding organization and space name to the rep logs is not a common pattern that we do in other places or components. Even if we passed in a logger object, those type of information can get outdated when a user runs |
Is your feature request related to a problem?
Right now, when
AppInstanceExceededLogRateLimitCount
is emitted (here), it is emitted as a per-cell tagged counter metric. The loggregator agent adds just the cell-specific tags to the metric, nothing more, resulting in metrics that look like this:As a result, operators must do some additional work to filter the app instances on the cell this metric comes from to identify the actual culprit of the chatty logging. This is not super straightforward and we can do better for operators so that it is much easier to identify problematic apps (and even app instances).
Describe the solution you'd like
In the the log rate limit reporter, we have access to the metric tags set on the app's desired LRP and use them to push our own log line into the app log stream. We should be able to do the same thing with our
AppInstanceExceededLogRateLimitCount
metric in order to tag it accordingly and ensure the value emitted is not a per-cell metric but a per-app instance metric.We can potentially just make a new version of our IncrementCounter method to add tags to the envelope we want to send using this option.
Diego repo
executor
Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: