Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peons should not report SysMonitor stats since MiddleManager reports them. #12802

Conversation

tejaswini-imply
Copy link
Member

Description

Sysmonitor stats (mem, fs, disk, net, cpu, swap, sys, tcp) are reported by all Druid processes, including Peons that are ephemeral in nature. Since Peons always run on the same host as the MiddleManager that spawned them and is unlikely to change, the SyMonitor metrics emitted by Peon are merely duplicates. This is often not a problem except when machines are super-beefy. Imagine a 64-core machine and 32 workers running on this machine. now you will have each Peon reporting metrics for each core. that's an increase of (32 * 64)x in the number of metrics. This leads to a metric explosion.

This PR updates MetricsModule to check node role running while registering SysMonitor and not to load any existing SysMonitor$Stats.


Key changed/added classes in this PR
  • MetricsModule
  • SysMonitor

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

import org.mockito.ArgumentMatchers;
import org.mockito.Mockito;

public class SysMonitorTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are failing right now. is there a way to exclude these tests just for ARM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate annotation can be created to disable these tests on arm64-graviton2 architecture conditionally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will that work exactly? can you describe in more detail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?

@Target({ ElementType.TYPE, ElementType.METHOD })
@Retention(RetentionPolicy.RUNTIME)
@ExtendWith(DisableOnArm64Condition.class)
public @interface DisabledOnArm64 {
}
public class DisableOnArm64Condition implements ExecutionCondition {
   @Override
   public ConditionEvaluationResult evaluateExecutionCondition(ExtensionContext context) {
      String osArch = System.getProperty("os.arch");
      if(osArch.equalsIgnoreCase("aarch64")) {
         return ConditionEvaluationResult.disabled("Test disabled on arm64 architecture");
      } else {
         return ConditionEvaluationResult.enabled("Test enabled");
      }
   }
}

Although @rohangarg pointed out to neat trick in SigarPidDiscovererTest using Assume which would be lot simpler.

Map<String, String[]> dimensions = MonitorsConfig.mapOfDatasourceAndTaskID(
dataSourceTaskIdHolder.getDataSource(),
dataSourceTaskIdHolder.getTaskId()
);
return new SysMonitor(dimensions);
return new SysMonitor(dimensions, isPeonRole(nodeRoles));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of leaking isPeon into the SysMonitor - have you considered introducing a new class NoopSysMonitor that does nothing. Then in this provider if the role is a Peon, return the noop class.

I think that would be a cleaner division of responsibilities and it would make it easier to test as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree although I was a little skeptical to use that if any extra functionality might be added in SysMonitor in the future that might want to be used on Peons. But that's a long shot. I'll go with NoopSysMonitor. Thanks.

@capistrant
Copy link
Contributor

This is a great find and fix. Looks like CI is upset about coverage on the new code, but I will be a +1 once that is resolved

@Test
public void testDoMonitor()
{
Assume.assumeFalse("aarch64".equals(CPU_ARCH));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it should be safe for this test to run on all architectures, but I guess there's no harm in skipping this test on aarch64

Suggested change
Assume.assumeFalse("aarch64".equals(CPU_ARCH));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails on aarch64 since SysMonitor is the parent of NoopSysMonitor. Sigar still gets loaded even though there's no need for it. Or should there be another class that these two classes inherit from to avoid this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I didn't notice that when I skimmed the change. The NoopSysMonitor should just implement Monitor since it is doing nothing. Or you can extend AbstractMonitor and return false in the doMonitor() method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But MetricsModule is getting the instance of SysMonitor through Guice while loading, and getSysMonitor() provides for SysMonitor. In case of peon returning NoopSysMonitor here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok. feel free to ignore my suggestion above.

There is a way to refactor this so that sigar won't initialize on the peons, but it seems like overkill.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of adding logic in getMonitorScheduler during loading for the same and it seemed overkill. If you don't mind could you please share if there is other way for future references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to do something like that, I would have made SysMonitor an interface and make the current class an implementation of that interface.

Another way to approach this would be to add a new annotation and inject the Monitor interface that way https://github.com/google/guice/wiki/BindingAnnotations

@suneet-s
Copy link
Contributor

I've retriggered what appears to have been flaky test failures.

+1 from me, I'll leave it to other committers in this thread to approve and merge this change.

Thanks for the contribution @tejaswini-imply !

@abhishekagarwal87 abhishekagarwal87 merged commit 5772dfd into apache:master Jul 23, 2022
@tejaswini-imply
Copy link
Member Author

Thanks @abhishekagarwal87, @suneet-s for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants