Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid doesn't explicitly support C4 garbage collector #13833

Closed
Holmistr opened this issue Feb 22, 2023 · 4 comments
Closed

Druid doesn't explicitly support C4 garbage collector #13833

Holmistr opened this issue Feb 22, 2023 · 4 comments

Comments

@Holmistr
Copy link

Description

Symptom

Running JvmMonitorTest on Azul Platform Prime (formerly known as Azul Zing, an OpenJDK based JVM) with C4 garbage collector results in timeout:

JvmMonitorTest.testGcCounts:49 TestTimedOut test timed out after 60000 milliseconds

Investigation and root cause

We investigated and additional logging in GcTrackingEmitter showed:

G1 GC:
event to map = {feed=metrics, metric=jvm/gc/cpu, service=test, host=localhost, gcName=[g1], value=0, gcGen=[old], timestamp=2023-02-21T16:56:15.839Z}

Prime:
event to map = {feed=metrics, metric=jvm/gc/count, service=test, host=localhost, gcName=[GPGC New], value=0, gcGen=[GPGC New], timestamp=2023-02-21T16:57:17.964Z}

gcGen set correctly for G1 because of the following method JvmMonitor.java#L223(there are no correct processing of Prime managers)

Due to this switch accepts an invalid string in 'emit' method(JvmMonitorTest.java#L89), GcTrackingEmitter does not update {old|young}GcCount and {old|young}GcSeen never return true.

We further found out that C4 is not in the list of explicitly known garbage collectors by Druid.

Workaround

There are two possible workarounds which make Prime "look like" it's G1. To be precise, they change the strings of memory managers to the values that G1 uses. The workarounds:

  • Adding -XX:+MimicG1GCMemoryManagerNames
  • Adding -XX:GPGCOldGCMemoryManagerName="G1 Old Generation"

Fix

We're happy to contribute the change, it seems rather straightforward. Perhaps some guidance from the Druid's team would be helpful to make sure we don't overlook any place where this needs to be changed. Any thoughts?

@imply-cheddar
Copy link
Contributor

This is really a problem with that test more than anything. The names that C4 is generating are fine and would "normally" flow out through metrics, just that test really wants to have some indication of young vs. old so that it can try to decide when it should stop. The test itself is dubious if you ask me (is it trying to validate that the JVM does GC? Or trying to validate things that the monitor does. Likely, the JvmMonitor itself should be getting its values from an interface (with an overloaded, default implementation that delegates to the MXBeans) and the test should be passing in its own definition of that interface to generate values that the test would validate. The fragility of the test seems to be rooted in trying to validate that the JVM does what's on the tin, which is something that I don't think our tests should be attempting to validate.

@Holmistr
Copy link
Author

@imply-cheddar Thanks for the reply.
How do you think we should proceed? I don't really like the idea that anybody that tries to run Druid on Prime (e.g. as a sanity check) will get a failure. It will look bad for Prime, while the root cause is actually in the test. I know that there won't be really many people like this, but still, the QA guys in me just can't live with it :) Additionally, we're probably going to test Druid internally, so we don't want such false positives either.

I'm happy to contribute the change, but I don't really have that much time for it. If you could provide some pointers, that would be great.

Copy link

github-actions bot commented Feb 7, 2024

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 7, 2024
Copy link

github-actions bot commented Mar 6, 2024

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants