Skip to content

Conversation

@MikeThomsen
Copy link
Contributor

@MikeThomsen MikeThomsen commented May 24, 2018

Thank you for submitting a contribution to Apache NiFi.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@ottobackwards
Copy link
Contributor

@MikeThomsen this is cool.

The only thing it makes me wonder is if this kind of data can't be automatically generated and sent to a repository, almost like a new ( or actual ) reporting task.

This seems like it lends itself to time series analysis like other things.

Nifi doesn't necessarily have to provide that repo.

@MikeThomsen
Copy link
Contributor Author

I'm not sure how that'd work out because you need to actually read the flowfiles and calculate the stats.

@ottobackwards
Copy link
Contributor

prob the reader and write would have to get some context passed where they can track states or increment stats, then be configured with a 'reporting' task to send the stats from a given context to

@MikeThomsen
Copy link
Contributor Author

MikeThomsen commented May 24, 2018

Not sure if I like that approach because it could get pretty complicated to make the hand-off not impact the processing to any meaningful degree. The beauty of what we did was it just puts the data into the provenance repository and there aren't that many flowfiles to track. Maybe a few hundred thousand over the entire data set if we use appropriately-sized batches from GetMongo. NiFi handles that like a champ and s2s prov reporting has no probably rapid-fire sending it over to our tracking instance of NiFi.

@ottobackwards
Copy link
Contributor

That makes sense, just thinking it through, and obviously I don't understand everything as well ;)
I guess I never thought of provenance as including perf and stats stuff, so it seems like putting it there is just doing it because that is the thing that is present to use, co-opting it so to speak.

@MikeThomsen
Copy link
Contributor Author

I'm not too familiar with the deep internals of the framework either. What we've seen is that with the records API it just makes sense to leverage the provenance system because it already tracks the attributes in a clean way you can leverage for stuff like giving managers a nice little ELK dashboard for the warm fuzzies.

Copy link
Contributor

@ottobackwards ottobackwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeThomsen thanks for the contribution, just a couple of nits.


protected Map<String, String> getStats(FlowFile input, Map<String, RecordPath> paths, ProcessContext context, ProcessSession session) {
try (InputStream is = session.read(input)) {
RecordReaderFactory factory = context.getProperty(RECORD_READER).asControllerService(RecordReaderFactory.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input and is as var names is confusing here, can we name them closer to what they are to keep them straight?
flowFile and inputStream?
There is only one flowFile to track in this processor so none to get fancy ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (Map.Entry<String, RecordPath> entry : paths.entrySet()) {
RecordPathResult result = entry.getValue().evaluate(record);
Optional<FieldValue> value = result.getSelectedFields().findFirst();
if (value.isPresent() && value.get().getValue() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is fieldValue needed? There are a lot of *values in this loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can drop that.

}

recordCount++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic Strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"record_count" => constant field maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a biggie

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@MikeThomsen
Copy link
Contributor Author

@ijokarumawak can you review?

Copy link
Member

@ijokarumawak ijokarumawak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeThomsen Thanks for adding this, this looks really useful! I haven't run it yet but looked through the code, and added few comments, as the first review cycle. Please check those out. Thanks!

.name("record-stats-reader")
.displayName("Record Reader")
.description("A record reader to use for reading the records.")
.addValidator(Validator.VALID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No validator is required for ControllerService.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

String approxValue = value.get().getValue().toString();
String key = String.format("%s.%s", entry.getKey(), approxValue);
Integer stat = retVal.containsKey(key) ? retVal.get(key) : 0;
Integer baseStat = retVal.containsKey(entry.getKey()) ? retVal.get(entry.getKey()) : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial, map.getOrDefault method can make these statement simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

RecordPathResult result = entry.getValue().evaluate(record);
Optional<FieldValue> value = result.getSelectedFields().findFirst();
if (value.isPresent() && value.get().getValue() != null) {
String approxValue = value.get().getValue().toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for improvement. If the RecordPath result is number, providing the counts based on value will be useful? Rather, I'd like to see in addition to counts (current baseStat), also min, max, sum and optionally sum of squared x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that as a separate ticket if you don't mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add this later.

I hoped it will be included when this processor is released. Changing it after would require adding new configuration property. Well, that makes sense in some cases, too. E.g. if a number field represents some category information, or act as an enum, then user would expected the number of occurrence per the number value.

return retVal.entrySet().stream()
.collect(Collectors.toMap(
e -> e.getKey(),
e -> e.getValue().toString()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the non-number values, such as the 'sport' dataset, the list can be huge, order of thousands or more. That can be too big to store as FlowFile attributes. At lease we need a fixed number of attributes for this processor can add. It would be more helpful if we can limit the N number of values (i.e. soccer, football, basketball ... etc) and report the highest N variables, to report better stats with a long-tail dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We can add a limit there.

@ijokarumawak
Copy link
Member

@MikeThomsen Thanks for the updates. I will continue more close review on this when I have time, probably tomorrow.

@MikeThomsen
Copy link
Contributor Author

@ijokarumawak Anything to add now or can we close this out?

Copy link
Member

@ijokarumawak ijokarumawak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeThomsen I have built and test it locally. Posted several comments but I believe this would be the final review cycle before getting this merged. Thanks!

@WritesAttributes({
@WritesAttribute(attribute = RecordStats.RECORD_COUNT_ATTR, description = "A count of the records in the record set in the flowfile.")
})
public class RecordStats extends AbstractProcessor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since NiFi uses naming convention starting with a verve, this processor should be named such as 'CalcurateRecordStats' .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@WritesAttributes({
@WritesAttribute(attribute = RecordStats.RECORD_COUNT_ATTR, description = "A count of the records in the record set in the flowfile.")
})
public class RecordStats extends AbstractProcessor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.nifi.processor.Processor file is not updated to use this new processor from NiFi flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.displayName("Record Reader")
.description("A record reader to use for reading the records.")
.identifiesControllerService(RecordReaderFactory.class)
.build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Record Reader' should be required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"user-defined criteria on subsets of the record set.")
@InputRequirement(InputRequirement.Requirement.INPUT_REQUIRED)
@WritesAttributes({
@WritesAttribute(attribute = RecordStats.RECORD_COUNT_ATTR, description = "A count of the records in the record set in the flowfile.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add recordStats.<User Defined Property Name>.count and recordStats.<User Defined Property Name>.count.<value>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the names, but done.

<li>record_count: 5</li>
<li>sport: 5</li>
<li>sport.Soccer: 3</li>
<li>sport.Football: 2</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These property names should be more self-descriptive and not to overlap other property name spaces.
I suggest following names:

current suggestion
record_count recordStats.count
sport recordStats.sport.count
sport.Soccer recordStats.sport.count.Soccer
sport.Football recordStats.sport.count.Football

Then we can add more stats later, such as recordStats.age.min or recordStats.age.max ... etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


protected Map<String, RecordPath> getRecordPaths(ProcessContext context) {
return context.getProperties().keySet()
.stream().filter(p -> p.isDynamic() && !p.getName().contains(RECORD_READER.getName()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think !p.getName().contains(RECORD_READER.getName()) part is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.collect(Collectors.toMap(
e -> e.getName(),
e -> {
String val = context.getProperty(e).getValue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect dynamic properties supports EL with FlowFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ijokarumawak
Copy link
Member

@MikeThomsen Thanks for the updates. LGTM, +1! Merging.

@asfgit asfgit closed this in 1803c15 Jun 11, 2018
@ijokarumawak
Copy link
Member

For future work, I've submitted this to add stats for numerical values.
https://issues.apache.org/jira/browse/NIFI-5291

@MikeThomsen MikeThomsen deleted the NIFI-5231 branch August 14, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants