-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-7429 Adding status history for system level metrics #4420
Conversation
@pvillard31 @mcgilman May I ask you to take a look on this? Thank you very much! |
Playing with it and it's awesome, thanks @simonbence for this pull request! Minor suggestions at the moment:
|
Thanks for the feedback @pvillard31 ! I fixed the order of the metrics. This is something was working the same semi-random way with other history panels as well. Now it orders based on "ordinal" of the metrics. Looks much more organized I also added the thread metrics you were asking for. I found no way to add multiple lines for the diagrams without serious refactors, so I hope it will meet your expectations the way it is. Furthermore I added detailed metrics about the given repositories as well (not only a summary for the given types) |
I've played with it and it looks good to me. That is an awesome addition to NiFi, thanks @simonbence ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @simonbence! Focusing primarily on the front end changes things look good. Just a minor typo and maybe a clarification about how fractions are treated.
...ework-core/src/main/java/org/apache/nifi/controller/status/history/NodeStatusDescriptor.java
Outdated
Show resolved
Hide resolved
@@ -78,6 +79,9 @@ | |||
}, | |||
'DATA_SIZE': function (d) { | |||
return nfCommon.formatDataSize(d); | |||
}, | |||
'FRACTION': function (d) { | |||
return nfCommon.formatFloat(d / 1000000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw there is a metric multiplier that is applied server-side and the comment indicates that this operation is needed before presenting the value to a user. However, I'm not quite following upon first review. Can you elaborate on this a little more and explain why it's needed? Just a little worried that we have a magic number here with no reference to the fraction multiplier being applied server-side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I did find is the metrics functionality depends on long data type as metric data type, especially on Java side. Changing this looks as it would come with serious and possibly risky refactoring. Contrary, the processor load is usually in a range one or two digit number with fractions. I was playing with with some ideas, trying to avoid this method, but in the end they did not work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, left one comment, plus still a few outstanding comments from Matt.
} | ||
) | ||
public Response getNodeHistory() { | ||
authorizeFlow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to move this end-point to the ControllerResource
and authorize against Read - /controller
. The reason being that most of the information in the node status history is really controller level information and is similar to what is returned from ControllerResource
for /controller/cluster
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks for highlighting! I moved it to the ControllerResource.
/** | ||
* The status of a storage repository. | ||
*/ | ||
public class StorageStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be Cloneable
?
@@ -3089,6 +3089,10 @@ recent change to the dataflow has caused a problem and needs to be fixed. The DF | |||
adjust the flow as needed to fix the problem. While NiFi does not have an "undo" feature, the DFM can make new changes to the | |||
dataflow that will fix the problem. | |||
|
|||
Select Node Status History to view instance specific metrics from the last 24 hours or if the instance runs for less time, then | |||
since it has been started. The status history can help the DFM in troubleshooting performance barriers and provides a general |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use performance issues
instead of performance barriers
(which is not used too often in this context).
new ValueReducer<StatusSnapshot, Long>() { | ||
@Override | ||
public Long reduce(final List<StatusSnapshot> values) { | ||
long sumUtilization = 0L; | ||
int invocations = 0; | ||
|
||
for (final StatusSnapshot snapshot : values) { | ||
final Long utilization = snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor()); | ||
if (utilization != null) { | ||
sumUtilization += utilization.longValue(); | ||
invocations++; | ||
} | ||
} | ||
|
||
if (invocations == 0) { | ||
return 0L; | ||
} | ||
|
||
return sumUtilization / invocations; | ||
} | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new ValueReducer<StatusSnapshot, Long>() { | |
@Override | |
public Long reduce(final List<StatusSnapshot> values) { | |
long sumUtilization = 0L; | |
int invocations = 0; | |
for (final StatusSnapshot snapshot : values) { | |
final Long utilization = snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor()); | |
if (utilization != null) { | |
sumUtilization += utilization.longValue(); | |
invocations++; | |
} | |
} | |
if (invocations == 0) { | |
return 0L; | |
} | |
return sumUtilization / invocations; | |
} | |
}), | |
new ValueReducer<StatusSnapshot, Long>() { | |
@Override | |
public Long reduce(final List<StatusSnapshot> values) { | |
return (long) values.stream() | |
.map(snapshot -> snapshot.getStatusMetric(HEAP_UTILIZATION.getDescriptor())) | |
.filter(Objects::nonNull) | |
.mapToLong(value -> value) | |
.average() | |
.orElse(0L); | |
} | |
}), |
USED_NON_HEAP( | ||
"usedNonHeap", | ||
"Used Non Heap", | ||
"The current memory usage of non-heap memory that is used by the Java virtual machine.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The current memory usage of non-heap memory that is used by the Java virtual machine.", | |
"The current usage of non-heap memory that is used by the Java virtual machine.", |
OPEN_FILE_HANDLERS( | ||
"openFileHandlers", | ||
"Open File Handlers", | ||
"The current number of open file descriptors used by the Java virtual machine.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file handle
is a more accurate term than file handler
"The current number of open file descriptors used by the Java virtual machine.", | |
"The current number of open file handles used by the Java virtual machine.", |
@Produces(MediaType.APPLICATION_JSON) | ||
@Path("status/history") | ||
@ApiOperation( | ||
value = "Gets configuration history for the node", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value = "Gets configuration history for the node", | |
value = "Gets status history for the node", |
@ApiResponse(code = 409, message = "The request was valid but NiFi was not in the appropriate state to process it. Retrying the same request later may be successful.") | ||
} | ||
) | ||
public Response getNodeHistory() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public Response getNodeHistory() { | |
public Response getNodeStatusHistory() { |
|
||
private long totalThreads; | ||
private long eventDrivenThreads; | ||
private long timeDrivenThreads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private long timeDrivenThreads; | |
private long timerDrivenThreads; |
/** | ||
* The status of a NiFi node. | ||
*/ | ||
public class NodeStatus implements Cloneable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for the new DTO classes? Couldn't we use the original SystemDiagnostics
and StorageStatus
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for SystemDiagnostics, NodeStatus consists only a part of it and also consists information from other source. As for StorageStatus, I do check on if we can spare that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked for StorageStatus versus StorageUsage and now I remember: StorageUsage (the original) is from the nifi-framework-core
module, but the places we intend to use StorageStatus is in the nifi-api (as these instances are exposed, together with other metrics related DTOs), so I needed to add these.
@@ -164,6 +183,182 @@ public StatusHistory getRemoteProcessGroupStatusHistory(final String remoteGroup | |||
return getStatusHistory(remoteGroupId, true, DEFAULT_RPG_METRICS, start, end, preferredDataPoints); | |||
} | |||
|
|||
@Override | |||
public StatusHistory getNodeStatusHistory() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be covered with unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is covered in VolatileComponentStatusRepositoryTest#testNodeHistory (which should be renamed however)
final List<MetricDescriptor<NodeStatus>> contentStorageStatusDescriptors = new LinkedList<>(); | ||
final List<MetricDescriptor<NodeStatus>> provenanceStorageStatusDescriptors = new LinkedList<>(); | ||
|
||
int ordinal = DEFAULT_NODE_METRICS.size() - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calculating the counter
,
final AtomicInteger index = new AtomicInteger(DEFAULT_NODE_METRICS.size());
could be used with index.getAndIncrement()
in every new StandardMetricDescriptor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's simplified a bit!
final int storageNumber = i; | ||
final int counter = metricDescriptors.size() - 1 + NUMBER_OF_STORAGE_METRICS * contentStorageNumber; | ||
|
||
contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use
contentStorageStatusDescriptors.add(new StandardMetricDescriptor<>( | |
metricDescriptors.add(new StandardMetricDescriptor<NodeStatus>( |
With this approach we could get rid of all the intermediary lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked this part a bit. Mainly simplification. Also I extracted the descriptor creation parts to make the flow easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
Merged to main, thanks for this awesome improvement ! |
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.
Signed-off-by: Pierre Villard <pierre.villard.fr@gmail.com> This closes apache#4420.
NIFI-7429
This is a proposal for having historical data about the NiFi node’s status appearing in the NiFi UI. The main purpose was to provide a simple tool makes it possible to check basic performance metrics on the UI.
From implementation perspective the solution is based on the existing status history function which was applied for components like processors. In front end side, the existing code is reused as much as possible, only some minor extension and duplication were needed. The main differences compared to the existing uses were the different trigger (this is reachable from the global menu) and the lack of some parameters like id or group.
The backend side builds on top of VolatileComponentStatusRepository which already responsible for such functions. I tried to add is as an integral part of the existing metric collection, so the frequency of the measurements and the way of triggering is not separated. The metrics themselves are distilled from the SystemDiagnostics and the already collected GarbageCollectionStatus.
Creating the snapshots came with three non-trivial cases I would like to highlight:
Thank you for your time and response!
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically
main
)?Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not
squash
or use--force
when pushing to allow for clean monitoring of changes.For code changes:
mvn -Pcontrib-check clean install
at the rootnifi
folder?LICENSE
file, including the mainLICENSE
file undernifi-assembly
?NOTICE
file, including the mainNOTICE
file found undernifi-assembly
?.displayName
in addition to .name (programmatic access) for each of the new properties?For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.