NIFI-7429 Adding status history for system level metrics #4420

simonbence · 2020-07-22T12:52:49Z

This is a proposal for having historical data about the NiFi node’s status appearing in the NiFi UI. The main purpose was to provide a simple tool makes it possible to check basic performance metrics on the UI.

From implementation perspective the solution is based on the existing status history function which was applied for components like processors. In front end side, the existing code is reused as much as possible, only some minor extension and duplication were needed. The main differences compared to the existing uses were the different trigger (this is reachable from the global menu) and the lack of some parameters like id or group.

The backend side builds on top of VolatileComponentStatusRepository which already responsible for such functions. I tried to add is as an integral part of the existing metric collection, so the frequency of the measurements and the way of triggering is not separated. The metrics themselves are distilled from the SystemDiagnostics and the already collected GarbageCollectionStatus.

Creating the snapshots came with three non-trivial cases I would like to highlight:

The GC metrics are not predefined as the type of GC is depending on the running environment and on the actual collectors. This prevented pre-defining the descriptors for them, thus these are created during requests.
Also, some GC metrics (time spent, counters) are growing in monotonous way as the metric collection stores the value shows the accumulated value from the start of the instance. In order to be able to provide the increment since the last measurement, the collection of the GC metrics are using the previous snapshot as baseline.
The processor load average (usually in the form of “2.3” or alike) does not fit into the “long” format used by the functionality without significant information loss. In order to avoid bigger refactors, I introduced a new formatter type, named “FRACTION”. By convention the server side multiplies these metrics using a predefined number (1_000_000 for now) and during visualisation the frontend divides the metric value with the same number. By this, we shift the relevant digits into long value range. Of course, this still comes with precision loss, but for visualisation purposes, this looks sufficient.

Thank you for your time and response!

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
Have you written or updated unit tests to verify your changes?
Have you verified that the full build is successful on JDK 8?
Have you verified that the full build is successful on JDK 11?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

simonbence · 2020-07-22T12:54:50Z

@pvillard31 @mcgilman May I ask you to take a look on this? Thank you very much!

pvillard31 · 2020-07-22T16:54:58Z

Playing with it and it's awesome, thanks @simonbence for this pull request! Minor suggestions at the moment:

ordering of the metrics on the UI side, I'd probably to group things together like: load stats, heap stats, repo stats, gc stats, file handler, etc. Not sure how easy it'd be.
there is the total number of threads which is great, it would be nice to also have the number of threads being used from the Timer Driven Thread pool, the number of threads being used from the Event Driven Thread pool. That would be great. Not sure if we display two lines on the same graph, but if possible, displaying the size value of the pool (basically the maximum the value can be over time) would be nice. That would be particularly useful to see if the thread pool is oversized or not.

simonbence · 2020-07-24T15:15:29Z

Thanks for the feedback @pvillard31 !

I fixed the order of the metrics. This is something was working the same semi-random way with other history panels as well. Now it orders based on "ordinal" of the metrics. Looks much more organized

I also added the thread metrics you were asking for. I found no way to add multiple lines for the diagrams without serious refactors, so I hope it will meet your expectations the way it is. Furthermore I added detailed metrics about the given repositories as well (not only a summary for the given types)

pvillard31 · 2020-07-30T15:42:19Z

I've played with it and it looks good to me. That is an awesome addition to NiFi, thanks @simonbence !
@markap14 - I think it'd be best if you can have a look to the code since you're familiar with this part
@mcgilman - as far as I can tell the UI part looks good to me, do you want to double check?

mcgilman

Thanks for the PR @simonbence! Focusing primarily on the front end changes things look good. Just a minor typo and maybe a clarification about how fractions are treated.

...ework-core/src/main/java/org/apache/nifi/controller/status/history/NodeStatusDescriptor.java

mcgilman · 2020-08-24T15:17:23Z

...mework-bundle/nifi-framework/nifi-web/nifi-web-ui/src/main/webapp/js/nf/nf-status-history.js

@@ -78,6 +79,9 @@
        },
        'DATA_SIZE': function (d) {
            return nfCommon.formatDataSize(d);
+        },
+        'FRACTION': function (d) {
+            return nfCommon.formatFloat(d / 1000000);


I saw there is a metric multiplier that is applied server-side and the comment indicates that this operation is needed before presenting the value to a user. However, I'm not quite following upon first review. Can you elaborate on this a little more and explain why it's needed? Just a little worried that we have a magic number here with no reference to the fraction multiplier being applied server-side.

What I did find is the metrics functionality depends on long data type as metric data type, especially on Java side. Changing this looks as it would come with serious and possibly risky refactoring. Contrary, the processor load is usually in a range one or two digit number with fractions. I was playing with with some ideas, trying to avoid this method, but in the end they did not work well.

bbende

Overall looks good, left one comment, plus still a few outstanding comments from Matt.

bbende · 2020-09-01T17:17:02Z

...nifi-framework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/FlowResource.java

+            }
+    )
+    public Response getNodeHistory() {
+        authorizeFlow();


I think we probably want to move this end-point to the ControllerResource and authorize against Read - /controller. The reason being that most of the information in the node status history is really controller level information and is similar to what is returned from ControllerResource for /controller/cluster.

Good point, thanks for highlighting! I moved it to the ControllerResource.

tpalfy · 2020-09-02T17:47:06Z

nifi-api/src/main/java/org/apache/nifi/controller/status/StorageStatus.java

+/**
+ * The status of a storage repository.
+ */
+public class StorageStatus {


Shouldn't this be Cloneable?

tpalfy · 2020-09-03T13:14:54Z

nifi-docs/src/main/asciidoc/user-guide.adoc

@@ -3089,6 +3089,10 @@ recent change to the dataflow has caused a problem and needs to be fixed. The DF
 adjust the flow as needed to fix the problem. While NiFi does not have an "undo" feature, the DFM can make new changes to the
 dataflow that will fix the problem.

+Select Node Status History to view instance specific metrics from the last 24 hours or if the instance runs for less time, then
+since it has been started. The status history can help the DFM in troubleshooting performance barriers and provides a general


I would use performance issues instead of performance barriers (which is not used too often in this context).

tpalfy · 2020-09-03T13:28:23Z

...ework-core/src/main/java/org/apache/nifi/controller/status/history/NodeStatusDescriptor.java

+            new ValueReducer<StatusSnapshot, Long>() {
+                @Override
+                public Long reduce(final List<StatusSnapshot> values) {
+                    long sumUtilization = 0L;
+                    int invocations = 0;
+
+                    for (final StatusSnapshot snapshot : values) {
+                        final Long utilization = snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor());
+                        if (utilization != null) {
+                            sumUtilization += utilization.longValue();
+                            invocations++;
+                        }
+                    }
+
+                    if (invocations == 0) {
+                        return 0L;
+                    }
+
+                    return sumUtilization / invocations;
+                }
+            }),


Suggested change

new ValueReducer<StatusSnapshot, Long>() {

@Override

public Long reduce(final List<StatusSnapshot> values) {

long sumUtilization = 0L;

int invocations = 0;

for (final StatusSnapshot snapshot : values) {

final Long utilization = snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor());

if (utilization != null) {

sumUtilization += utilization.longValue();

invocations++;

}

}

if (invocations == 0) {

return 0L;

}

return sumUtilization / invocations;

}

}),

new ValueReducer<StatusSnapshot, Long>() {

@Override

public Long reduce(final List<StatusSnapshot> values) {

return (long) values.stream()

.map(snapshot -> snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor()))

.filter(Objects::nonNull)

.mapToLong(value -> value)

.average()

.orElse(0L);

}

}),

tpalfy · 2020-09-03T13:31:12Z

...ework-core/src/main/java/org/apache/nifi/controller/status/history/NodeStatusDescriptor.java

+    USED_NON_HEAP(
+            "usedNonHeap",
+            "Used Non Heap",
+            "The current memory usage of non-heap memory that is used by the Java virtual machine.",


Suggested change

"The current memory usage of non-heap memory that is used by the Java virtual machine.",

"The current usage of non-heap memory that is used by the Java virtual machine.",

tpalfy · 2020-09-03T13:36:03Z

...ework-core/src/main/java/org/apache/nifi/controller/status/history/NodeStatusDescriptor.java

+    OPEN_FILE_HANDLERS(
+            "openFileHandlers",
+            "Open File Handlers",
+            "The current number of open file descriptors used by the Java virtual machine.",


file handle is a more accurate term than file handler

Suggested change

"The current number of open file descriptors used by the Java virtual machine.",

"The current number of open file handles used by the Java virtual machine.",

tpalfy · 2020-09-03T15:27:34Z

...ramework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/ControllerResource.java

+    @Produces(MediaType.APPLICATION_JSON)
+    @Path("status/history")
+    @ApiOperation(
+            value = "Gets configuration history for the node",


Suggested change

value = "Gets configuration history for the node",

value = "Gets status history for the node",

tpalfy · 2020-09-03T15:30:29Z

...ramework/nifi-web/nifi-web-api/src/main/java/org/apache/nifi/web/api/ControllerResource.java

+                    @ApiResponse(code = 409, message = "The request was valid but NiFi was not in the appropriate state to process it. Retrying the same request later may be successful.")
+            }
+    )
+    public Response getNodeHistory() {


Suggested change

public Response getNodeHistory() {

public Response getNodeStatusHistory() {

tpalfy · 2020-09-03T16:36:42Z

nifi-api/src/main/java/org/apache/nifi/controller/status/NodeStatus.java

+
+    private long totalThreads;
+    private long eventDrivenThreads;
+    private long timeDrivenThreads;


Suggested change

private long timeDrivenThreads;

private long timerDrivenThreads;

tpalfy · 2020-09-03T16:45:47Z

nifi-api/src/main/java/org/apache/nifi/controller/status/NodeStatus.java

+/**
+ * The status of a NiFi node.
+ */
+public class NodeStatus implements Cloneable {


Is there a reason for the new DTO classes? Couldn't we use the original SystemDiagnostics and StorageStatus instead?

As for SystemDiagnostics, NodeStatus consists only a part of it and also consists information from other source. As for StorageStatus, I do check on if we can spare that.

I checked for StorageStatus versus StorageUsage and now I remember: StorageUsage (the original) is from the nifi-framework-core module, but the places we intend to use StorageStatus is in the nifi-api (as these instances are exposed, together with other metrics related DTOs), so I needed to add these.

tpalfy · 2020-09-03T17:12:55Z

...c/main/java/org/apache/nifi/controller/status/history/VolatileComponentStatusRepository.java

@@ -164,6 +183,182 @@ public StatusHistory getRemoteProcessGroupStatusHistory(final String remoteGroup
        return getStatusHistory(remoteGroupId, true, DEFAULT_RPG_METRICS, start, end, preferredDataPoints);
    }

+    @Override
+    public StatusHistory getNodeStatusHistory() {


Could this be covered with unit tests?

It is covered in VolatileComponentStatusRepositoryTest#testNodeHistory (which should be renamed however)

tpalfy · 2020-09-04T13:14:25Z

...c/main/java/org/apache/nifi/controller/status/history/VolatileComponentStatusRepository.java

+        final List<MetricDescriptor<NodeStatus>> contentStorageStatusDescriptors = new LinkedList<>();
+        final List<MetricDescriptor<NodeStatus>> provenanceStorageStatusDescriptors = new LinkedList<>();
+
+        int ordinal = DEFAULT_NODE_METRICS.size() - 1;


Instead of calculating the counter,
final AtomicInteger index = new AtomicInteger(DEFAULT_NODE_METRICS.size()); could be used with index.getAndIncrement() in every new StandardMetricDescriptor

Thanks, it's simplified a bit!

tpalfy · 2020-09-04T13:15:41Z

...c/main/java/org/apache/nifi/controller/status/history/VolatileComponentStatusRepository.java

+                final int storageNumber = i;
+                final int counter = metricDescriptors.size() - 1 + NUMBER_OF_STORAGE_METRICS * contentStorageNumber;
+
+                contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>(


Could use

Suggested change

contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>(

metricDescriptors.add(new StandardMetricDescriptor<NodeStatus>(

With this approach we could get rid of all the intermediary lists.

I reworked this part a bit. Mainly simplification. Also I extracted the descriptor creation parts to make the flow easier to read.

…ry panel

tpalfy

LGTM +1

pvillard31 · 2020-09-10T14:20:12Z

Merged to main, thanks for this awesome improvement !

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.

NIFI-7429 Adding status history for system level metrics

225d3d9

simonbence force-pushed the NIFI-7429 branch from f87b800 to 225d3d9 Compare July 22, 2020 15:03

NIFI-7429 Refining functinality based on review comments

0b0531d

mcgilman reviewed Aug 24, 2020

View reviewed changes

bbende reviewed Sep 1, 2020

View reviewed changes

NIFI-7429 Code review changes

1d103b6

tpalfy requested changes Sep 3, 2020

View reviewed changes

tpalfy requested changes Sep 4, 2020

View reviewed changes

simonbence added 2 commits September 7, 2020 14:09

NIFI-7429 Code review changes & removing misleading uptime from histo…

234fdd8

…ry panel

NIFI-7429 Code review suggestion

2ff2209

tpalfy approved these changes Sep 9, 2020

View reviewed changes

pvillard31 approved these changes Sep 10, 2020

View reviewed changes

asfgit closed this in 0dff3bc Sep 10, 2020

driesva pushed a commit to driesva/nifi that referenced this pull request Mar 19, 2021

NIFI-7429 Adding status history for system level metrics

e5ad606

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.

adenes pushed a commit to adenes/nifi that referenced this pull request Jul 5, 2021

NIFI-7429 Adding status history for system level metrics

f92a681

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.

krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022

NIFI-7429 Adding status history for system level metrics

5284773

Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-7429 Adding status history for system level metrics #4420

NIFI-7429 Adding status history for system level metrics #4420

simonbence commented Jul 22, 2020 •

edited

Loading

simonbence commented Jul 22, 2020 •

edited

Loading

pvillard31 commented Jul 22, 2020

simonbence commented Jul 24, 2020

pvillard31 commented Jul 30, 2020

mcgilman left a comment

mcgilman Aug 24, 2020

simonbence Sep 3, 2020

bbende left a comment

bbende Sep 1, 2020

simonbence Sep 3, 2020

tpalfy Sep 2, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

tpalfy Sep 3, 2020

simonbence Sep 4, 2020

simonbence Sep 4, 2020

tpalfy Sep 3, 2020

simonbence Sep 4, 2020

tpalfy Sep 4, 2020 •

edited

Loading

simonbence Sep 7, 2020

tpalfy Sep 4, 2020

simonbence Sep 7, 2020

tpalfy left a comment

pvillard31 commented Sep 10, 2020

	"The current memory usage of non-heap memory that is used by the Java virtual machine.",
	"The current usage of non-heap memory that is used by the Java virtual machine.",

	"The current number of open file descriptors used by the Java virtual machine.",
	"The current number of open file handles used by the Java virtual machine.",

	value = "Gets configuration history for the node",
	value = "Gets status history for the node",

	public Response getNodeHistory() {
	public Response getNodeStatusHistory() {

	private long timeDrivenThreads;
	private long timerDrivenThreads;

	contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>(
	metricDescriptors.add(new StandardMetricDescriptor<NodeStatus>(

NIFI-7429 Adding status history for system level metrics #4420

NIFI-7429 Adding status history for system level metrics #4420

Conversation

simonbence commented Jul 22, 2020 • edited Loading

For all changes:

For code changes:

For documentation related changes:

Note:

simonbence commented Jul 22, 2020 • edited Loading

pvillard31 commented Jul 22, 2020

simonbence commented Jul 24, 2020

pvillard31 commented Jul 30, 2020

mcgilman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbende left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpalfy Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpalfy left a comment

Choose a reason for hiding this comment

pvillard31 commented Sep 10, 2020

simonbence commented Jul 22, 2020 •

edited

Loading

simonbence commented Jul 22, 2020 •

edited

Loading

tpalfy Sep 4, 2020 •

edited

Loading