Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-7812] Log system resources metrics #4801

Merged
merged 2 commits into from Aug 6, 2018
Merged

Conversation

pnowojski
Copy link
Contributor

@pnowojski pnowojski commented Oct 11, 2017

What is the purpose of the change

This PR adds various system resources metrics, useful for analysing issues on machines/clusters for which there are no detailed external resources logging systems.

Verifying this change

This change was mostly manually tested, since it's difficult to write automated tests for this feature. However there is some trivial SystemResourcesCounterTest to make sure that codes executes without obvious mistakes.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)

It adds ohci dependency.

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@pnowojski pnowojski force-pushed the resources branch 3 times, most recently from 776bb1d to 1dbafb7 Compare October 12, 2017 09:11
@StephanEwen
Copy link
Contributor

Thanks for this addition. Few comments:

  • Please try to follow the common (though not enforced) code style when it comes to empty lines between class declarations, fields, methods, etc.

  • What is the license of the new dependency? Can you send a link so we can verify that?

  • If possible, let's shade that dependency, avoid increasing the dependency footprint.

  • Can we harmonize the config keys between the memory logger and the new system metrics?

@pnowojski
Copy link
Contributor Author

As on dev mailing list discussion, this feature uses https://github.com/oshi/oshi library licensed under EPL 1.0. It seems to be compatible with ours: https://www.apache.org/legal/resolved.html .

It has minimal external dependencies. Question is whether we want to shade everything that we add?

Definitely we could unify config options if we want to do that.

@StephanEwen
Copy link
Contributor

Eclipse Public License is not impossible, but tricky.

I am not a lawyer, but this is what I picked up over the year: EPL is weak copyleft, meaning linking is okay, but modifying not (from Apache License compatibility) . Shading the code in the library (which we do when building the flink dist jar) is a bit of an gray zone. It does not violate the spirit of the license, but a court may see that differently.

Various Flink users that approached us to avoid weak copyleft as much as possible because of that uncertainty, so avoiding this dependency would be desirable.

Making it an optional dependency that users explicitly have to add is possible, because then we do shade it into the Flink distribution jar.

@pnowojski
Copy link
Contributor Author

I have made this dependency optional.

@pnowojski pnowojski force-pushed the resources branch 2 times, most recently from 04a1d30 to 0aa27fb Compare November 28, 2017 09:18
Copy link
Contributor

@StephanEwen StephanEwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, all in all.

I would ask to change all occurences of "additional logging" to "system resource logging", because "additional logging" is pretty non-descriptive.

There are also various places that are not checkstyle compliant, but not automatically caught, because the directory is currently checkstyle excluded. Please adjust them anyways, to make it easier to adopt more directories for checkstyle in the future.

/**
* Integration tests for proper initialization in the {@link TaskManagerRunner}.
*/
public class TaskManagerRunnerITCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems very specific to this logging, but is named as a generic TaskManagerRunner test. Give it a differnet name?

Separate question: Does it have to be an IT case that fully starts the TM, or can it be a unit test that checks config propagation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to SystemResourcesMetricsITCase.

I have added this test after some larger rebasing conflicts, because otherwise I wasn't sure that everything is starting up correctly. I think that there would have to be a handful unit tests replacing this one ITCase and they would be more prone to fail/cause problems during refactoring.

* CPU ticks and network sent/received bytes and then convert those values to CPU usage and
* send/receive byte rates.
*/
@ThreadSafe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Thread is ThreadSafe? ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm :) This annotation is about public getters, but maybe indeed it doesn't make a lot of sense. On the other hand, is it harmful?

}
});
final NetworkBufferPool networkBufferPool = network.getNetworkBufferPool();
metrics.<Long, Gauge<Long>>gauge("TotalMemorySegments", () -> (long) networkBufferPool.getTotalNumberOfMemorySegments());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with "Integer" Gauge and change to method reference?

@pnowojski pnowojski force-pushed the resources branch 2 times, most recently from 4f791c0 to aae077c Compare December 11, 2017 10:27
@pnowojski pnowojski force-pushed the resources branch 3 times, most recently from 292d6ab to d4f48fa Compare June 29, 2018 12:59
Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one comment about settings naming but I can't really comment on the implementation. @zentol would be best for that

@@ -37,6 +37,16 @@
<td style="word-wrap: break-word;">true</td>
<td>Enable SSL support for the taskmanager data transport. This is applicable only when the global ssl flag security.ssl.enabled is set to true</td>
</tr>
<tr>
<td><h5>taskmanager.debug.additional-logging</h5></td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing these to system-resource-logging, as Stephan suggested?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing is actually being logger as far as i can tell, so I would suggest system-resource-metrics and system-resource-metrics-polling-interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed them to: system-resource-metrics and system-resource-metrics-probing-interval. polling didn't seem quite appropriate.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In several places we refer to "logging" additional metrics which seems misleading, as nothing is actually logged.

Why are we restricting this to taskmanagers? I'm reasonably certain that any user that enables this for TMs would like to enable this for all containers for consistency alone.

resources metrics are updated periodically and they present average values for a configured
interval (`taskmanager.debug.additional-logging-interval`).

Logging of system resources require couple of optional dependencies to be present on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"requires a"

@@ -37,6 +37,16 @@
<td style="word-wrap: break-word;">true</td>
<td>Enable SSL support for the taskmanager data transport. This is applicable only when the global ssl flag security.ssl.enabled is set to true</td>
</tr>
<tr>
<td><h5>taskmanager.debug.additional-logging</h5></td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing is actually being logger as far as i can tell, so I would suggest system-resource-metrics and system-resource-metrics-polling-interval.

import static org.apache.flink.util.Preconditions.checkState;

/**
* Daemon thread logging system resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated, nothing is being logged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Daemon thread probing system resources.

import oshi.hardware.HardwareAbstractionLayer;

/**
* Utility class ot initialize system resources metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to initialize system resource"

}
catch (NoClassDefFoundError ex) {
LOG.warn(
"Failed to initialize system resources metrics because of missing classes definitions." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"system resource metrics" "missing class definitions"

@@ -1884,13 +1884,23 @@ object TaskManager {

// if desired, start the logging daemon that periodically logs the
// memory usage information
if (LOG.isInfoEnabled && configuration.getBoolean(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes shouldn't be necessary anymore.

@zentol
Copy link
Contributor

zentol commented Aug 1, 2018

Also needs a rebase.

classpath (for example placed in Flink's `lib` directory):

- `com.github.oshi:oshi-core:3.4.0` (licensed under EPL 1.0 license)
- `net.java.dev.jna:jna-platform:jar:4.2.2`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these transitive dependency of oshi-core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I have rephrased this paragraph.

}
} catch (InterruptedException e) {
if (running) {
LOG.error("{} has failed", SystemResourcesCounter.class.getSimpleName(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log as warning instead

Copy link
Contributor Author

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Applied requested changes.

@@ -37,6 +37,16 @@
<td style="word-wrap: break-word;">true</td>
<td>Enable SSL support for the taskmanager data transport. This is applicable only when the global ssl flag security.ssl.enabled is set to true</td>
</tr>
<tr>
<td><h5>taskmanager.debug.additional-logging</h5></td>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed them to: system-resource-metrics and system-resource-metrics-probing-interval. polling didn't seem quite appropriate.

classpath (for example placed in Flink's `lib` directory):

- `com.github.oshi:oshi-core:3.4.0` (licensed under EPL 1.0 license)
- `net.java.dev.jna:jna-platform:jar:4.2.2`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I have rephrased this paragraph.

import static org.apache.flink.util.Preconditions.checkState;

/**
* Daemon thread logging system resources.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Daemon thread probing system resources.

@pnowojski
Copy link
Contributor Author

@zentol rebased and added system resource metrics both to job manager and task manager.

@pnowojski pnowojski force-pushed the resources branch 2 times, most recently from a279b1b to 2ab886d Compare August 2, 2018 13:01
@@ -22,6 +22,16 @@
<td style="word-wrap: break-word;">"child-first"</td>
<td>Defines the class resolution strategy when loading classes from user code, meaning whether to first check the user code jar ("child-first") or the application classpath ("parent-first"). The default settings indicate to load classes first from the user code jar, which means that user code jars can include and load different dependencies than Flink uses (transitively).</td>
</tr>
<tr>
<td><h5>debug.system-resource-metrics</h5></td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving these to MetricOptions? (they could also have a "metrics" prefix instead)

resources metrics are updated periodically and they present average values for a configured
interval (`taskmanager.debug.additional-logging-interval`).
System resources reporting is disabled by default. When `debug.system-resource-metrics`
is enabled additional metrics listed below will be available on TaskManager and JobManager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"on Job- and TaskManagers"

@@ -2517,9 +2517,11 @@ object JobManager {
}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.fail;

/**
* Integration tests for proper initialization of the system resource metrics in the {@link TaskManagerRunner}.
* Integration tests for proper initialization of the system resource metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should be updated to check that they are reported for both JM and TM.

* limitations under the License.
*/

package org.apache.flink.runtime.taskexecutor.utils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to "runtime.metrics"

* limitations under the License.
*/

package org.apache.flink.runtime.taskexecutor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to "runtime.metrics"


@Test
public void startTaskManagerAndCheckForRegisteredSystemMetrics() throws Exception {
assertEquals(1, TestReporter.OPENED_REPORTERS.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be 2 reporters (it should fail in the current state)

Copy link
Contributor Author

@pnowojski pnowojski Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be one. MiniCluster is starting up both job manager and task manager reusing the same reporter. Their metrics are only prefixed with localhost.jobmanager. and localhost.taskamanger.TASKMANAGER_ID.

Added test coverage to look for both jobmanager and taskmanager metrics.

@zentol
Copy link
Contributor

zentol commented Aug 3, 2018

+1

@pnowojski
Copy link
Contributor Author

Thanks for the review @zentol ! Merging :)

@asfgit asfgit merged commit cd880fb into apache:master Aug 6, 2018
NicoK pushed a commit to dataArtisans/flink that referenced this pull request Aug 8, 2018
NicoK pushed a commit to dataArtisans/flink that referenced this pull request Sep 18, 2018
Clarkkkkk pushed a commit to Clarkkkkk/flink that referenced this pull request Nov 29, 2018
This closes apache#4801.

(cherry picked from commit cd880fb)

(cherry picked from commit 80e6b77)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants