Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[logging] fix consume rate logging bug to respect 1 minute threshold #11421

Merged

Conversation

jadami10
Copy link
Contributor

This is a small fix in realtime servers to avoid excessive logging. This will now correctly log at most once per minute or 100k events.

We should have been logging consume rate + other metadata every minute or every 100k events. But for some reason the code uses lastConsumeCount to check if we should use consumeStartTime or lastLogTime as previous time. Since we have partitions with low to no event volume (QA and such), lastConsumeCount is often 0, so this causes emits a log on every consume loop. Now we always use lastLogTime as prevTime when it's not 0.

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2023

Codecov Report

Merging #11421 (1df775c) into master (b670c55) will increase coverage by 0.12%.
Report is 6 commits behind head on master.
The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master   #11421      +/-   ##
============================================
+ Coverage     62.92%   63.05%   +0.12%     
+ Complexity     1096      207     -889     
============================================
  Files          2301     2303       +2     
  Lines        123780   124041     +261     
  Branches      18831    18895      +64     
============================================
+ Hits          77888    78211     +323     
+ Misses        40360    40272      -88     
- Partials       5532     5558      +26     
Flag Coverage Δ
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 63.00% <66.66%> (+0.10%) ⬆️
java-17 62.85% <66.66%> (+0.05%) ⬆️
java-20 62.84% <66.66%> (+48.33%) ⬆️
temurin 63.05% <66.66%> (+0.12%) ⬆️
unittests 63.04% <66.66%> (+0.12%) ⬆️
unittests1 67.55% <66.66%> (+0.09%) ⬆️
unittests2 14.52% <0.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...manager/realtime/LLRealtimeSegmentDataManager.java 50.83% <66.66%> (ø)

... and 43 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

// Log every minute or 100k events
if (now - prevTime > TimeUnit.MINUTES.toMillis(TIME_THRESHOLD_FOR_LOG_MINUTES)
|| rowsConsumed >= MSG_COUNT_THRESHOLD_FOR_LOG) {
_segmentLogger.info(
"Consumed {} events from (rate:{}/s), currentOffset={}, numRowsConsumedSoFar={}, numRowsIndexedSoFar={}",
// multiply by 1000 to get events/sec. now and prevTime are in milliseconds.
rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, but I guess the intention should be

Suggested change
rowsConsumed, (float) (rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,
rowsConsumed, ((float) rowsConsumed) * 1000 / (now - prevTime), _currentOffset, _numRowsConsumed,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it your suggestion but also pulled it out into a separate var. I did test before and after with RealtimeQuickstart, and it did seem to be logging correctly before as well. But I figure clearer casting can't hurt

@Jackie-Jiang Jackie-Jiang merged commit 2b40362 into apache:master Aug 25, 2023
21 checks passed
aidar-stripe pushed a commit to aidar-stripe/pinot that referenced this pull request Nov 22, 2023
…pache#295)

### Notify
cc stripe-private-oss-forks/pinot-reviewers


### Summary
Gist for pre merge diffs: https://git.corp.stripe.com/gist/5d1c5798dac53c9ca2cf791e153ca406
Gist for post merge diffs: https://git.corp.stripe.com/gist/d14940e109b4ea6c9552a1775a2872d3
Gist for diff of diffs: https://git.corp.stripe.com/gist/0228792a51c0920b2e59725c44b721de
Gist for merge conflict resolution: https://git.corp.stripe.com/gist/15826b47e566a1673db8a1ffdeec2eda
For upstream pulls, please list all the PRs of interest and risk

I needed to add 1 more commit because a duplicate method didn't show up as a conflict.

- apache#11184 upstream commit for Priyen's new tenants endpoint
- apache#10770 UI now actually supports multiple options
- apache#11301 now there's a metric for broker queries missing segments. but we already fail these queries
- apache#11151 MSE now returns the correct type on SUM/MAX/MIN
- apache#11128 OSS added an api to rebalance a whole tenant! it looks a bit complicated. likely needs some testing
- apache#11273 API to get server pool from the server
- apache#11328 this changes some CASE/WHEN semantics during upgrade. We only use case/when in RDP and auth rates queries. **todo** let's check with startree if this is ok or not
- apache#11341 new `/query` API uses MSE by default
- apache#11349 fixes "query time" in logs when broker response is empty
- apache#11363 and apache#11513 better exception formatting in the UI
- apache#11380 table shutdowns should generally be safer (can't query/add/remove segments)
- apache#11345 pulling in the freshness checker fixes we have already
- apache#11421 logging fix we contributed
- apache#11369 this upgrades hadoop, but also we shade it. we need to test this on prod sandbox
- apache#11443 fix we upstreamed to throttling metrics
- apache#11467 SQL between is now inclusive. this was only broken if it was in the select clause like `CASE WHEN X between 0 and 1`. we don't allow that form in decibel, https://livegrep.corp.stripe.com/view/stripe-internal/zoolander/src/scala/com/stripe/decibel/pinot/PinotJobConfig.scala#L115
- apache#11476 this is some fix for changing partition count on kafka topics. we don't allow this internally, but maybe this gets us closer
- apache#11532 upstream of handling empty parquet files
- apache#11515 upsert snapshot size metric, we don't use snapshots yet for RDP
- apache#11506 makes broker response public, might be useful for findata since they use java client
- apache#11553 we don't use protos or support nulls yet, but we will

### Motivation
- this is the PR the 1.0.0 release is based on
- This gets us to < 730 lines of difference between us and OSS!

### Testing
For upstream pulls: https://docs.google.com/document/d/1Zj6lBimVlCDBCycBI6IuTEuI5xwtYUYeDDOMpkwNHhk/edit#heading=h.g37jl5jhx4j5

### Rollout/monitoring/revert plan

(Merge-committed by Merge Queue - Original PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/295)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants