New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-7812] Log system resources metrics #4801
Conversation
776bb1d
to
1dbafb7
Compare
Thanks for this addition. Few comments:
|
As on dev mailing list discussion, this feature uses https://github.com/oshi/oshi library licensed under EPL 1.0. It seems to be compatible with ours: https://www.apache.org/legal/resolved.html . It has minimal external dependencies. Question is whether we want to shade everything that we add? Definitely we could unify config options if we want to do that. |
Eclipse Public License is not impossible, but tricky. I am not a lawyer, but this is what I picked up over the year: EPL is weak copyleft, meaning linking is okay, but modifying not (from Apache License compatibility) . Shading the code in the library (which we do when building the flink dist jar) is a bit of an gray zone. It does not violate the spirit of the license, but a court may see that differently. Various Flink users that approached us to avoid weak copyleft as much as possible because of that uncertainty, so avoiding this dependency would be desirable. Making it an optional dependency that users explicitly have to add is possible, because then we do shade it into the Flink distribution jar. |
I have made this dependency optional. |
04a1d30
to
0aa27fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, all in all.
I would ask to change all occurences of "additional logging" to "system resource logging", because "additional logging" is pretty non-descriptive.
There are also various places that are not checkstyle compliant, but not automatically caught, because the directory is currently checkstyle excluded. Please adjust them anyways, to make it easier to adopt more directories for checkstyle in the future.
/** | ||
* Integration tests for proper initialization in the {@link TaskManagerRunner}. | ||
*/ | ||
public class TaskManagerRunnerITCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems very specific to this logging, but is named as a generic TaskManagerRunner test. Give it a differnet name?
Separate question: Does it have to be an IT case that fully starts the TM, or can it be a unit test that checks config propagation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to SystemResourcesMetricsITCase
.
I have added this test after some larger rebasing conflicts, because otherwise I wasn't sure that everything is starting up correctly. I think that there would have to be a handful unit tests replacing this one ITCase
and they would be more prone to fail/cause problems during refactoring.
* CPU ticks and network sent/received bytes and then convert those values to CPU usage and | ||
* send/receive byte rates. | ||
*/ | ||
@ThreadSafe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Thread is ThreadSafe? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm :) This annotation is about public getters, but maybe indeed it doesn't make a lot of sense. On the other hand, is it harmful?
} | ||
}); | ||
final NetworkBufferPool networkBufferPool = network.getNetworkBufferPool(); | ||
metrics.<Long, Gauge<Long>>gauge("TotalMemorySegments", () -> (long) networkBufferPool.getTotalNumberOfMemorySegments()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with "Integer" Gauge and change to method reference?
4f791c0
to
aae077c
Compare
292d6ab
to
d4f48fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one comment about settings naming but I can't really comment on the implementation. @zentol would be best for that
@@ -37,6 +37,16 @@ | |||
<td style="word-wrap: break-word;">true</td> | |||
<td>Enable SSL support for the taskmanager data transport. This is applicable only when the global ssl flag security.ssl.enabled is set to true</td> | |||
</tr> | |||
<tr> | |||
<td><h5>taskmanager.debug.additional-logging</h5></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing these to system-resource-logging
, as Stephan suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing is actually being logger as far as i can tell, so I would suggest system-resource-metrics
and system-resource-metrics-polling-interval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed them to: system-resource-metrics
and system-resource-metrics-probing-interval
. polling
didn't seem quite appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In several places we refer to "logging" additional metrics which seems misleading, as nothing is actually logged.
Why are we restricting this to taskmanagers? I'm reasonably certain that any user that enables this for TMs would like to enable this for all containers for consistency alone.
docs/monitoring/metrics.md
Outdated
resources metrics are updated periodically and they present average values for a configured | ||
interval (`taskmanager.debug.additional-logging-interval`). | ||
|
||
Logging of system resources require couple of optional dependencies to be present on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"requires a"
@@ -37,6 +37,16 @@ | |||
<td style="word-wrap: break-word;">true</td> | |||
<td>Enable SSL support for the taskmanager data transport. This is applicable only when the global ssl flag security.ssl.enabled is set to true</td> | |||
</tr> | |||
<tr> | |||
<td><h5>taskmanager.debug.additional-logging</h5></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing is actually being logger as far as i can tell, so I would suggest system-resource-metrics
and system-resource-metrics-polling-interval
.
import static org.apache.flink.util.Preconditions.checkState; | ||
|
||
/** | ||
* Daemon thread logging system resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated, nothing is being logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to Daemon thread probing system resources.
import oshi.hardware.HardwareAbstractionLayer; | ||
|
||
/** | ||
* Utility class ot initialize system resources metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to initialize system resource"
} | ||
catch (NoClassDefFoundError ex) { | ||
LOG.warn( | ||
"Failed to initialize system resources metrics because of missing classes definitions." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"system resource metrics" "missing class definitions"
@@ -1884,13 +1884,23 @@ object TaskManager { | |||
|
|||
// if desired, start the logging daemon that periodically logs the | |||
// memory usage information | |||
if (LOG.isInfoEnabled && configuration.getBoolean( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes shouldn't be necessary anymore.
Also needs a rebase. |
docs/monitoring/metrics.md
Outdated
classpath (for example placed in Flink's `lib` directory): | ||
|
||
- `com.github.oshi:oshi-core:3.4.0` (licensed under EPL 1.0 license) | ||
- `net.java.dev.jna:jna-platform:jar:4.2.2` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these transitive dependency of oshi-core
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I have rephrased this paragraph.
} | ||
} catch (InterruptedException e) { | ||
if (running) { | ||
LOG.error("{} has failed", SystemResourcesCounter.class.getSimpleName(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log as warning instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Applied requested changes.
@@ -37,6 +37,16 @@ | |||
<td style="word-wrap: break-word;">true</td> | |||
<td>Enable SSL support for the taskmanager data transport. This is applicable only when the global ssl flag security.ssl.enabled is set to true</td> | |||
</tr> | |||
<tr> | |||
<td><h5>taskmanager.debug.additional-logging</h5></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed them to: system-resource-metrics
and system-resource-metrics-probing-interval
. polling
didn't seem quite appropriate.
docs/monitoring/metrics.md
Outdated
classpath (for example placed in Flink's `lib` directory): | ||
|
||
- `com.github.oshi:oshi-core:3.4.0` (licensed under EPL 1.0 license) | ||
- `net.java.dev.jna:jna-platform:jar:4.2.2` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I have rephrased this paragraph.
import static org.apache.flink.util.Preconditions.checkState; | ||
|
||
/** | ||
* Daemon thread logging system resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to Daemon thread probing system resources.
@zentol rebased and added system resource metrics both to job manager and task manager. |
a279b1b
to
2ab886d
Compare
@@ -22,6 +22,16 @@ | |||
<td style="word-wrap: break-word;">"child-first"</td> | |||
<td>Defines the class resolution strategy when loading classes from user code, meaning whether to first check the user code jar ("child-first") or the application classpath ("parent-first"). The default settings indicate to load classes first from the user code jar, which means that user code jars can include and load different dependencies than Flink uses (transitively).</td> | |||
</tr> | |||
<tr> | |||
<td><h5>debug.system-resource-metrics</h5></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving these to MetricOptions
? (they could also have a "metrics" prefix instead)
docs/monitoring/metrics.md
Outdated
resources metrics are updated periodically and they present average values for a configured | ||
interval (`taskmanager.debug.additional-logging-interval`). | ||
System resources reporting is disabled by default. When `debug.system-resource-metrics` | ||
is enabled additional metrics listed below will be available on TaskManager and JobManager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"on Job- and TaskManagers"
@@ -2517,9 +2517,11 @@ object JobManager { | |||
} | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
import static org.junit.Assert.assertEquals; | ||
import static org.junit.Assert.fail; | ||
|
||
/** | ||
* Integration tests for proper initialization of the system resource metrics in the {@link TaskManagerRunner}. | ||
* Integration tests for proper initialization of the system resource metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be updated to check that they are reported for both JM and TM.
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.flink.runtime.taskexecutor.utils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to "runtime.metrics"
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.flink.runtime.taskexecutor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to "runtime.metrics"
|
||
@Test | ||
public void startTaskManagerAndCheckForRegisteredSystemMetrics() throws Exception { | ||
assertEquals(1, TestReporter.OPENED_REPORTERS.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be 2 reporters (it should fail in the current state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be one. MiniCluster
is starting up both job manager and task manager reusing the same reporter. Their metrics are only prefixed with localhost.jobmanager.
and localhost.taskamanger.TASKMANAGER_ID.
Added test coverage to look for both jobmanager and taskmanager metrics.
+1 |
Thanks for the review @zentol ! Merging :) |
This closes apache#4801. (cherry picked from commit cd880fb) (cherry picked from commit 80e6b77)
What is the purpose of the change
This PR adds various system resources metrics, useful for analysing issues on machines/clusters for which there are no detailed external resources logging systems.
Verifying this change
This change was mostly manually tested, since it's difficult to write automated tests for this feature. However there is some trivial
SystemResourcesCounterTest
to make sure that codes executes without obvious mistakes.Does this pull request potentially affect one of the following parts:
It adds
ohci
dependency.Documentation