New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[logging] fix consume rate logging bug to respect 1 minute threshold #11421
[logging] fix consume rate logging bug to respect 1 minute threshold #11421
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11421 +/- ##
============================================
+ Coverage 62.92% 63.05% +0.12%
+ Complexity 1096 207 -889
============================================
Files 2301 2303 +2
Lines 123780 124041 +261
Branches 18831 18895 +64
============================================
+ Hits 77888 78211 +323
+ Misses 40360 40272 -88
- Partials 5532 5558 +26
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 43 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
// Log every minute or 100k events | ||
if (now - prevTime > TimeUnit.MINUTES.toMillis(TIME_THRESHOLD_FOR_LOG_MINUTES) | ||
|| rowsConsumed >= MSG_COUNT_THRESHOLD_FOR_LOG) { | ||
_segmentLogger.info( | ||
"Consumed {} events from (rate:{}/s), currentOffset={}, numRowsConsumedSoFar={}, numRowsIndexedSoFar={}", | ||
// multiply by 1000 to get events/sec. now and prevTime are in milliseconds. | ||
rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introduced in this PR, but I guess the intention should be
rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed, | |
rowsConsumed, ((float) rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it your suggestion but also pulled it out into a separate var. I did test before and after with RealtimeQuickstart, and it did seem to be logging correctly before as well. But I figure clearer casting can't hurt
…pache#295) ### Notify cc stripe-private-oss-forks/pinot-reviewers ### Summary Gist for pre merge diffs: https://git.corp.stripe.com/gist/5d1c5798dac53c9ca2cf791e153ca406 Gist for post merge diffs: https://git.corp.stripe.com/gist/d14940e109b4ea6c9552a1775a2872d3 Gist for diff of diffs: https://git.corp.stripe.com/gist/0228792a51c0920b2e59725c44b721de Gist for merge conflict resolution: https://git.corp.stripe.com/gist/15826b47e566a1673db8a1ffdeec2eda For upstream pulls, please list all the PRs of interest and risk I needed to add 1 more commit because a duplicate method didn't show up as a conflict. - apache#11184 upstream commit for Priyen's new tenants endpoint - apache#10770 UI now actually supports multiple options - apache#11301 now there's a metric for broker queries missing segments. but we already fail these queries - apache#11151 MSE now returns the correct type on SUM/MAX/MIN - apache#11128 OSS added an api to rebalance a whole tenant! it looks a bit complicated. likely needs some testing - apache#11273 API to get server pool from the server - apache#11328 this changes some CASE/WHEN semantics during upgrade. We only use case/when in RDP and auth rates queries. **todo** let's check with startree if this is ok or not - apache#11341 new `/query` API uses MSE by default - apache#11349 fixes "query time" in logs when broker response is empty - apache#11363 and apache#11513 better exception formatting in the UI - apache#11380 table shutdowns should generally be safer (can't query/add/remove segments) - apache#11345 pulling in the freshness checker fixes we have already - apache#11421 logging fix we contributed - apache#11369 this upgrades hadoop, but also we shade it. we need to test this on prod sandbox - apache#11443 fix we upstreamed to throttling metrics - apache#11467 SQL between is now inclusive. this was only broken if it was in the select clause like `CASE WHEN X between 0 and 1`. we don't allow that form in decibel, https://livegrep.corp.stripe.com/view/stripe-internal/zoolander/src/scala/com/stripe/decibel/pinot/PinotJobConfig.scala#L115 - apache#11476 this is some fix for changing partition count on kafka topics. we don't allow this internally, but maybe this gets us closer - apache#11532 upstream of handling empty parquet files - apache#11515 upsert snapshot size metric, we don't use snapshots yet for RDP - apache#11506 makes broker response public, might be useful for findata since they use java client - apache#11553 we don't use protos or support nulls yet, but we will ### Motivation - this is the PR the 1.0.0 release is based on - This gets us to < 730 lines of difference between us and OSS! ### Testing For upstream pulls: https://docs.google.com/document/d/1Zj6lBimVlCDBCycBI6IuTEuI5xwtYUYeDDOMpkwNHhk/edit#heading=h.g37jl5jhx4j5 ### Rollout/monitoring/revert plan (Merge-committed by Merge Queue - Original PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/295)
This is a small
fix
in realtime servers to avoid excessive logging. This will now correctly log at most once per minute or 100k events.We should have been logging consume rate + other metadata every minute or every 100k events. But for some reason the code uses
lastConsumeCount
to check if we should useconsumeStartTime
orlastLogTime
as previous time. Since we have partitions with low to no event volume (QA and such),lastConsumeCount
is often 0, so this causes emits a log on every consume loop. Now we always uselastLogTime
asprevTime
when it's not 0.