Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-7429 Adding status history for system level metrics #4420

Closed
wants to merge 5 commits into from

Conversation

simonbence
Copy link
Contributor

@simonbence simonbence commented Jul 22, 2020

NIFI-7429

This is a proposal for having historical data about the NiFi node’s status appearing in the NiFi UI. The main purpose was to provide a simple tool makes it possible to check basic performance metrics on the UI.

From implementation perspective the solution is based on the existing status history function which was applied for components like processors. In front end side, the existing code is reused as much as possible, only some minor extension and duplication were needed. The main differences compared to the existing uses were the different trigger (this is reachable from the global menu) and the lack of some parameters like id or group.

The backend side builds on top of VolatileComponentStatusRepository which already responsible for such functions. I tried to add is as an integral part of the existing metric collection, so the frequency of the measurements and the way of triggering is not separated. The metrics themselves are distilled from the SystemDiagnostics and the already collected GarbageCollectionStatus.

Creating the snapshots came with three non-trivial cases I would like to highlight:

  1. The GC metrics are not predefined as the type of GC is depending on the running environment and on the actual collectors. This prevented pre-defining the descriptors for them, thus these are created during requests.
  2. Also, some GC metrics (time spent, counters) are growing in monotonous way as the metric collection stores the value shows the accumulated value from the start of the instance. In order to be able to provide the increment since the last measurement, the collection of the GC metrics are using the previous snapshot as baseline.
  3. The processor load average (usually in the form of “2.3” or alike) does not fit into the “long” format used by the functionality without significant information loss. In order to avoid bigger refactors, I introduced a new formatter type, named “FRACTION”. By convention the server side multiplies these metrics using a predefined number (1_000_000 for now) and during visualisation the frontend divides the metric value with the same number. By this, we shift the relevant digits into long value range. Of course, this still comes with precision loss, but for visualisation purposes, this looks sufficient.

Thank you for your time and response!

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

@simonbence
Copy link
Contributor Author

simonbence commented Jul 22, 2020

@pvillard31 @mcgilman May I ask you to take a look on this? Thank you very much!

@pvillard31
Copy link
Contributor

Playing with it and it's awesome, thanks @simonbence for this pull request! Minor suggestions at the moment:

  • ordering of the metrics on the UI side, I'd probably to group things together like: load stats, heap stats, repo stats, gc stats, file handler, etc. Not sure how easy it'd be.
  • there is the total number of threads which is great, it would be nice to also have the number of threads being used from the Timer Driven Thread pool, the number of threads being used from the Event Driven Thread pool. That would be great. Not sure if we display two lines on the same graph, but if possible, displaying the size value of the pool (basically the maximum the value can be over time) would be nice. That would be particularly useful to see if the thread pool is oversized or not.

@simonbence
Copy link
Contributor Author

Thanks for the feedback @pvillard31 !

I fixed the order of the metrics. This is something was working the same semi-random way with other history panels as well. Now it orders based on "ordinal" of the metrics. Looks much more organized

I also added the thread metrics you were asking for. I found no way to add multiple lines for the diagrams without serious refactors, so I hope it will meet your expectations the way it is. Furthermore I added detailed metrics about the given repositories as well (not only a summary for the given types)

@pvillard31
Copy link
Contributor

I've played with it and it looks good to me. That is an awesome addition to NiFi, thanks @simonbence !
@markap14 - I think it'd be best if you can have a look to the code since you're familiar with this part
@mcgilman - as far as I can tell the UI part looks good to me, do you want to double check?

Copy link
Contributor

@mcgilman mcgilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @simonbence! Focusing primarily on the front end changes things look good. Just a minor typo and maybe a clarification about how fractions are treated.

@@ -78,6 +79,9 @@
},
'DATA_SIZE': function (d) {
return nfCommon.formatDataSize(d);
},
'FRACTION': function (d) {
return nfCommon.formatFloat(d / 1000000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there is a metric multiplier that is applied server-side and the comment indicates that this operation is needed before presenting the value to a user. However, I'm not quite following upon first review. Can you elaborate on this a little more and explain why it's needed? Just a little worried that we have a magic number here with no reference to the fraction multiplier being applied server-side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I did find is the metrics functionality depends on long data type as metric data type, especially on Java side. Changing this looks as it would come with serious and possibly risky refactoring. Contrary, the processor load is usually in a range one or two digit number with fractions. I was playing with with some ideas, trying to avoid this method, but in the end they did not work well.

Copy link
Contributor

@bbende bbende left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, left one comment, plus still a few outstanding comments from Matt.

}
)
public Response getNodeHistory() {
authorizeFlow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to move this end-point to the ControllerResource and authorize against Read - /controller. The reason being that most of the information in the node status history is really controller level information and is similar to what is returned from ControllerResource for /controller/cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for highlighting! I moved it to the ControllerResource.

/**
* The status of a storage repository.
*/
public class StorageStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Cloneable?

@@ -3089,6 +3089,10 @@ recent change to the dataflow has caused a problem and needs to be fixed. The DF
adjust the flow as needed to fix the problem. While NiFi does not have an "undo" feature, the DFM can make new changes to the
dataflow that will fix the problem.

Select Node Status History to view instance specific metrics from the last 24 hours or if the instance runs for less time, then
since it has been started. The status history can help the DFM in troubleshooting performance barriers and provides a general
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use performance issues instead of performance barriers (which is not used too often in this context).

Comment on lines 42 to 62
new ValueReducer<StatusSnapshot, Long>() {
@Override
public Long reduce(final List<StatusSnapshot> values) {
long sumUtilization = 0L;
int invocations = 0;

for (final StatusSnapshot snapshot : values) {
final Long utilization = snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor());
if (utilization != null) {
sumUtilization += utilization.longValue();
invocations++;
}
}

if (invocations == 0) {
return 0L;
}

return sumUtilization / invocations;
}
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new ValueReducer<StatusSnapshot, Long>() {
@Override
public Long reduce(final List<StatusSnapshot> values) {
long sumUtilization = 0L;
int invocations = 0;
for (final StatusSnapshot snapshot : values) {
final Long utilization = snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor());
if (utilization != null) {
sumUtilization += utilization.longValue();
invocations++;
}
}
if (invocations == 0) {
return 0L;
}
return sumUtilization / invocations;
}
}),
new ValueReducer<StatusSnapshot, Long>() {
@Override
public Long reduce(final List<StatusSnapshot> values) {
return (long) values.stream()
.map(snapshot -> snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor()))
.filter(Objects::nonNull)
.mapToLong(value -> value)
.average()
.orElse(0L);
}
}),

USED_NON_HEAP(
"usedNonHeap",
"Used Non Heap",
"The current memory usage of non-heap memory that is used by the Java virtual machine.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The current memory usage of non-heap memory that is used by the Java virtual machine.",
"The current usage of non-heap memory that is used by the Java virtual machine.",

OPEN_FILE_HANDLERS(
"openFileHandlers",
"Open File Handlers",
"The current number of open file descriptors used by the Java virtual machine.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file handle is a more accurate term than file handler

Suggested change
"The current number of open file descriptors used by the Java virtual machine.",
"The current number of open file handles used by the Java virtual machine.",

@Produces(MediaType.APPLICATION_JSON)
@Path("status/history")
@ApiOperation(
value = "Gets configuration history for the node",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value = "Gets configuration history for the node",
value = "Gets status history for the node",

@ApiResponse(code = 409, message = "The request was valid but NiFi was not in the appropriate state to process it. Retrying the same request later may be successful.")
}
)
public Response getNodeHistory() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public Response getNodeHistory() {
public Response getNodeStatusHistory() {


private long totalThreads;
private long eventDrivenThreads;
private long timeDrivenThreads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private long timeDrivenThreads;
private long timerDrivenThreads;

/**
* The status of a NiFi node.
*/
public class NodeStatus implements Cloneable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the new DTO classes? Couldn't we use the original SystemDiagnostics and StorageStatus instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for SystemDiagnostics, NodeStatus consists only a part of it and also consists information from other source. As for StorageStatus, I do check on if we can spare that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked for StorageStatus versus StorageUsage and now I remember: StorageUsage (the original) is from the nifi-framework-core module, but the places we intend to use StorageStatus is in the nifi-api (as these instances are exposed, together with other metrics related DTOs), so I needed to add these.

@@ -164,6 +183,182 @@ public StatusHistory getRemoteProcessGroupStatusHistory(final String remoteGroup
return getStatusHistory(remoteGroupId, true, DEFAULT_RPG_METRICS, start, end, preferredDataPoints);
}

@Override
public StatusHistory getNodeStatusHistory() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be covered with unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is covered in VolatileComponentStatusRepositoryTest#testNodeHistory (which should be renamed however)

final List<MetricDescriptor<NodeStatus>> contentStorageStatusDescriptors = new LinkedList<>();
final List<MetricDescriptor<NodeStatus>> provenanceStorageStatusDescriptors = new LinkedList<>();

int ordinal = DEFAULT_NODE_METRICS.size() - 1;
Copy link
Contributor

@tpalfy tpalfy Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calculating the counter,
final AtomicInteger index = new AtomicInteger(DEFAULT_NODE_METRICS.size()); could be used with index.getAndIncrement() in every new StandardMetricDescriptor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's simplified a bit!

final int storageNumber = i;
final int counter = metricDescriptors.size() - 1 + NUMBER_OF_STORAGE_METRICS * contentStorageNumber;

contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use

Suggested change
contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>(
metricDescriptors.add(new StandardMetricDescriptor<NodeStatus>(

With this approach we could get rid of all the intermediary lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this part a bit. Mainly simplification. Also I extracted the descriptor creation parts to make the flow easier to read.

Copy link
Contributor

@tpalfy tpalfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@asfgit asfgit closed this in 0dff3bc Sep 10, 2020
@pvillard31
Copy link
Contributor

Merged to main, thanks for this awesome improvement !

driesva pushed a commit to driesva/nifi that referenced this pull request Mar 19, 2021
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4420.
adenes pushed a commit to adenes/nifi that referenced this pull request Jul 5, 2021
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4420.
krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com>

This closes apache#4420.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants