[logging] fix consume rate logging bug to respect 1 minute threshold #11421

jadami10 · 2023-08-23T20:54:30Z

This is a small fix in realtime servers to avoid excessive logging. This will now correctly log at most once per minute or 100k events.

We should have been logging consume rate + other metadata every minute or every 100k events. But for some reason the code uses lastConsumeCount to check if we should use consumeStartTime or lastLogTime as previous time. Since we have partitions with low to no event volume (QA and such), lastConsumeCount is often 0, so this causes emits a log on every consume loop. Now we always use lastLogTime as prevTime when it's not 0.

codecov-commenter · 2023-08-23T21:34:22Z

Codecov Report

Merging #11421 (1df775c) into master (b670c55) will increase coverage by 0.12%.
Report is 6 commits behind head on master.
The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master   #11421      +/-   ##
============================================
+ Coverage     62.92%   63.05%   +0.12%     
+ Complexity     1096      207     -889     
============================================
  Files          2301     2303       +2     
  Lines        123780   124041     +261     
  Branches      18831    18895      +64     
============================================
+ Hits          77888    78211     +323     
+ Misses        40360    40272      -88     
- Partials       5532     5558      +26

Flag	Coverage Δ
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`63.00% <66.66%> (+0.10%)`	⬆️
java-17	`62.85% <66.66%> (+0.05%)`	⬆️
java-20	`62.84% <66.66%> (+48.33%)`	⬆️
temurin	`63.05% <66.66%> (+0.12%)`	⬆️
unittests	`63.04% <66.66%> (+0.12%)`	⬆️
unittests1	`67.55% <66.66%> (+0.09%)`	⬆️
unittests2	`14.52% <0.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...manager/realtime/LLRealtimeSegmentDataManager.java	`50.83% <66.66%> (ø)`

... and 43 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jackie-Jiang

Good catch!

Jackie-Jiang · 2023-08-24T20:31:47Z

.../src/main/java/org/apache/pinot/core/data/manager/realtime/LLRealtimeSegmentDataManager.java

    // Log every minute or 100k events
    if (now - prevTime > TimeUnit.MINUTES.toMillis(TIME_THRESHOLD_FOR_LOG_MINUTES)
        || rowsConsumed >= MSG_COUNT_THRESHOLD_FOR_LOG) {
      _segmentLogger.info(
          "Consumed {} events from (rate:{}/s), currentOffset={}, numRowsConsumedSoFar={}, numRowsIndexedSoFar={}",
+          // multiply by 1000 to get events/sec. now and prevTime are in milliseconds.
          rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,


Not introduced in this PR, but I guess the intention should be

Suggested change

rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,

rowsConsumed, ((float) rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,

I changed it your suggestion but also pulled it out into a separate var. I did test before and after with RealtimeQuickstart, and it did seem to be logging correctly before as well. But I figure clearer casting can't hurt

…pache#295) ### Notify cc stripe-private-oss-forks/pinot-reviewers ### Summary Gist for pre merge diffs: https://git.corp.stripe.com/gist/5d1c5798dac53c9ca2cf791e153ca406 Gist for post merge diffs: https://git.corp.stripe.com/gist/d14940e109b4ea6c9552a1775a2872d3 Gist for diff of diffs: https://git.corp.stripe.com/gist/0228792a51c0920b2e59725c44b721de Gist for merge conflict resolution: https://git.corp.stripe.com/gist/15826b47e566a1673db8a1ffdeec2eda For upstream pulls, please list all the PRs of interest and risk I needed to add 1 more commit because a duplicate method didn't show up as a conflict. - apache#11184 upstream commit for Priyen's new tenants endpoint - apache#10770 UI now actually supports multiple options - apache#11301 now there's a metric for broker queries missing segments. but we already fail these queries - apache#11151 MSE now returns the correct type on SUM/MAX/MIN - apache#11128 OSS added an api to rebalance a whole tenant! it looks a bit complicated. likely needs some testing - apache#11273 API to get server pool from the server - apache#11328 this changes some CASE/WHEN semantics during upgrade. We only use case/when in RDP and auth rates queries. **todo** let's check with startree if this is ok or not - apache#11341 new `/query` API uses MSE by default - apache#11349 fixes "query time" in logs when broker response is empty - apache#11363 and apache#11513 better exception formatting in the UI - apache#11380 table shutdowns should generally be safer (can't query/add/remove segments) - apache#11345 pulling in the freshness checker fixes we have already - apache#11421 logging fix we contributed - apache#11369 this upgrades hadoop, but also we shade it. we need to test this on prod sandbox - apache#11443 fix we upstreamed to throttling metrics - apache#11467 SQL between is now inclusive. this was only broken if it was in the select clause like `CASE WHEN X between 0 and 1`. we don't allow that form in decibel, https://livegrep.corp.stripe.com/view/stripe-internal/zoolander/src/scala/com/stripe/decibel/pinot/PinotJobConfig.scala#L115 - apache#11476 this is some fix for changing partition count on kafka topics. we don't allow this internally, but maybe this gets us closer - apache#11532 upstream of handling empty parquet files - apache#11515 upsert snapshot size metric, we don't use snapshots yet for RDP - apache#11506 makes broker response public, might be useful for findata since they use java client - apache#11553 we don't use protos or support nulls yet, but we will ### Motivation - this is the PR the 1.0.0 release is based on - This gets us to < 730 lines of difference between us and OSS! ### Testing For upstream pulls: https://docs.google.com/document/d/1Zj6lBimVlCDBCycBI6IuTEuI5xwtYUYeDDOMpkwNHhk/edit#heading=h.g37jl5jhx4j5 ### Rollout/monitoring/revert plan (Merge-committed by Merge Queue - Original PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/295)

[logging] fix consume rate logging bug to respect 1 minute threshold

701fe66

Jackie-Jiang approved these changes Aug 24, 2023

View reviewed changes

fix rate casting; put rate in explicit var

1df775c

Jackie-Jiang merged commit 2b40362 into apache:master Aug 25, 2023
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[logging] fix consume rate logging bug to respect 1 minute threshold #11421

[logging] fix consume rate logging bug to respect 1 minute threshold #11421

jadami10 commented Aug 23, 2023

codecov-commenter commented Aug 23, 2023 •

edited

Jackie-Jiang left a comment

Jackie-Jiang Aug 24, 2023

jadami10 Aug 24, 2023

	rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,
	rowsConsumed, ((float) rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,

[logging] fix consume rate logging bug to respect 1 minute threshold #11421

[logging] fix consume rate logging bug to respect 1 minute threshold #11421

Conversation

jadami10 commented Aug 23, 2023

codecov-commenter commented Aug 23, 2023 • edited

Codecov Report

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Aug 24, 2023

Choose a reason for hiding this comment

jadami10 Aug 24, 2023

Choose a reason for hiding this comment

codecov-commenter commented Aug 23, 2023 •

edited