-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-1705 - Adding AttributesToCSV processor #2711
Conversation
commit 9c31e45 Author: Joe Trite <joetrite@gmail.com> Date: Mon Mar 13 07:55:19 2017 -0400 NIFI-1705 Adding AttributesToCSV processor commit 5e9afa9 Merge: 3177eb1 74cbfc4 Author: Joe Trite <joetrite@gmail.com> Date: Sat Mar 4 08:12:39 2017 -0500 Merge remote-tracking branch 'origin/master' # Conflicts: # nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ScanAttribute.java commit 3177eb1 Author: Joe Trite <joetrite@gmail.com> Date: Sat Mar 4 08:09:06 2017 -0500 NIFI-3497 Converted properties to use displayName, changed validator on demarcator property and created a method for duplicate code. commit 74cbfc4 Merge: a974f78 f8cad0f Author: Joe Trite <joetrite@gmail.com> Date: Sat Mar 4 07:47:46 2017 -0500 Merge branch 'master' into master commit a974f78 Author: Joe Trite <joetrite@gmail.com> Date: Sat Mar 4 07:43:02 2017 -0500 NIFI-3497 Converted properties to use displayName, changed validator on demarcator property and created a method for duplicate code. commit 1bfaef8 Merge: 65ed46d 89ec68d Author: Joe Trite <joetrite@gmail.com> Date: Fri Mar 3 08:01:59 2017 -0500 Merge branch 'master' of https://github.com/joetrite/nifi commit 65ed46d Author: Joe Trite <joetrite@gmail.com> Date: Fri Feb 24 18:09:36 2017 -0500 NIFI-3497 - fixing Pcontrib issues commit c5d52cf Author: Joe Trite <joetrite@gmail.com> Date: Thu Feb 23 10:19:01 2017 -0500 NIFI-3497 - excluding test files Adding new test data files to exclude list. commit b195934 Author: Joe Trite <joetrite@gmail.com> Date: Wed Feb 22 16:48:10 2017 -0500 NIFI-3497 - New dictionary files for test Adding new dictionary files to support metadata dictionary option. commit e296268 Author: Joe Trite <joetrite@gmail.com> Date: Wed Feb 22 16:46:13 2017 -0500 NIFI-3497 test cases for metadata updates Adding test cases to support metadata option update. commit de7e348 Author: Joe Trite <joetrite@gmail.com> Date: Wed Feb 22 16:36:08 2017 -0500 NIFI-3497 - Added metadata option Added optional to post additional metadata as new attributed if a match is found in the dictionary. commit 89ec68d Author: Joe Trite <joetrite@gmail.com> Date: Fri Feb 24 18:09:36 2017 -0500 NIFI-3497 - fixing Pcontrib issues commit d714260 Author: Joe Trite <joetrite@gmail.com> Date: Thu Feb 23 10:19:01 2017 -0500 NIFI-3497 - excluding test files Adding new test data files to exclude list. commit a7a7b6a Author: Joe Trite <joetrite@gmail.com> Date: Wed Feb 22 16:48:10 2017 -0500 NIFI-3497 - New dictionary files for test Adding new dictionary files to support metadata dictionary option. commit 8eb54a5 Author: Joe Trite <joetrite@gmail.com> Date: Wed Feb 22 16:46:13 2017 -0500 NIFI-3497 test cases for metadata updates Adding test cases to support metadata option update. commit f52e1f2 Author: Joe Trite <joetrite@gmail.com> Date: Wed Feb 22 16:36:08 2017 -0500 NIFI-3497 - Added metadata option Added optional to post additional metadata as new attributed if a match is found in the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking solid. Needs a little cleanup.
.build(); | ||
|
||
public static final PropertyDescriptor NULL_VALUE_FOR_EMPTY_STRING = new PropertyDescriptor.Builder() | ||
.name(("null-value")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have an extra set of parenthesis here.
"string will be placed in the CSV") | ||
.required(true) | ||
.allowableValues("true", "false") | ||
.defaultValue("false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add StandardValidators.BOOLEAN_VALIDATOR here.
"contained in every FlowFile, should be included in the final CSV value generated. The Attribute List property " + | ||
"overrides this setting.") | ||
.required(true) | ||
.allowableValues("true", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add StandardValidators.BOOLEAN_VALIDATOR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (ats != null) { | ||
for (String str : ats) { | ||
String trim = str.trim(); | ||
result.add(trim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra indentation.
testRunner.setProperty(AttributesToCSV.NULL_VALUE_FOR_EMPTY_STRING, "false"); | ||
|
||
|
||
ProcessSession session = testRunner.getProcessSessionFactory().createSession(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this. You can just do this:
Map<String, String> attrs = new HashMap<String, String>(){{
put("x", "1");
put("y", "2");
}};
testRunner.enqueue("some-data", attrs);
testRunner.setProperty(AttributesToCSV.NULL_VALUE_FOR_EMPTY_STRING, "false"); | ||
|
||
ProcessSession session = testRunner.getProcessSessionFactory().createSession(); | ||
FlowFile ff = session.create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about this. You can wipe out these references and use the cleaner map-driven approach.
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR) | ||
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES) | ||
.build(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if destination is the name that should be used here. It almost always is use with relationships and connections.
I know there are other processors that do attributes -or- flow file output, I just can't find one right now. But the name here should be similar to that.
@MikeThomsen do you know of one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the AttributesToJSON code so i just inherited these names. I'll change it to whatever you guys recommend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what, I am wrong, we are consistent in using it inconsistently, please disregard. Sorry for the noise.
.displayName("Attribute List") | ||
.description("Comma separated list of attributes to be included in the resulting CSV. If this value " + | ||
"is left empty then all existing Attributes will be included. This list of attributes is " + | ||
"case sensitive and does not support attribute names that contain commas. If an attribute specified in the list is not found it will be emitted " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use http://commons.apache.org/proper/commons-lang/javadocs/api-release/org/apache/commons/lang3/StringEscapeUtils.html as you do with values for the attributes and headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, thanks for the feedback.
@joetrite , I would like to mention it under the current PR - there was a discussion about having "record-aware" output. This will need to be addressed, but I believe is not part of this PR, as there are use cases when having schema will limit capabilities of this processor. I.e. when list of attributes is empty, it is expected to get all flow file attributes on output, but the list is unknown, so a schema cannot be pre-defined. @mattyb149 , as alternative, we can issue enhancement jira for AttributeToRecord to address schema-specific extracts. |
@joetrite , while AttributesToJSON has regex support for attributes, new AttributesToCSV doesn't have it - is there a reason for not keeping consistency? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some java related questions/comments
"escaped with double quotes. Any double quote characters in the attribute value are escaped with " + | ||
"another double quote. If the attribute value does not contain a comma, newline or double quote, then the " + | ||
"attribute value is returned unchanged.") | ||
@WritesAttribute(attribute = "CSVAttributes", description = "CSV representation of Attributes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case a destination is 'content', CSVAttributes won't be written at all (neither null nor empty string). We could mention that in description.
"and the 'Include Core Attributes' property is false, the core attribute will be included. The attribute list " + | ||
"ALWAYS wins.") | ||
.required(false) | ||
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the attribute supports EL, can we use NON_EMPTY_EL_VALIDATOR?
"or written in the flowfile content. Writing to flowfile content will overwrite any " + | ||
"existing flowfile content.") | ||
.required(true) | ||
.allowableValues(OUTPUT_NEW_ATTRIBUTE, OUTPUT_OVERWRITE_CONTENT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AllowableValue Class - it's more flexible than just a string. So, on line 91 for description of "destination" you don't need to put complex explanation, instead - add description for each allowable value separately.
.build(); | ||
|
||
public static final PropertyDescriptor INCLUDE_CORE_ATTRIBUTES = new PropertyDescriptor.Builder() | ||
.name("include-core-userSpecifiedAttributes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a name of the property suggests 'userSpecifiedAttributes', which is confusing in conjunction with prefix 'include-core-'. IMO should be just 'include-core-attributes'
|
||
private List<PropertyDescriptor> properties; | ||
private Set<Relationship> relationships; | ||
private volatile Set<String> userSpecifiedAttributes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable isn't being reused as object level variable. Also, you reset it every time in onTrigger, and pass as param to other methods. IMO it's better to keep it as local variable in onTrigger method.
String[] ats = StringUtils.split(attributeList, OUTPUT_SEPARATOR); | ||
if (ats != null) { | ||
for (String str : ats) { | ||
String trim = str.trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no point to define local var, since it's not being reused. You can just have result.add(str.trim());
|
||
private Set<String> attributeListStringToSet(String attributeList) { | ||
//take the user specified attribute list string and convert to list of strings. | ||
Set<String> result = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using HashSet you will ignore order of provided attributes, so you won't have control on CSV structure. If you change it to LinkedHashSet - you'll honor the order.
Map<String, String> result; | ||
if (!userSpecifiedAttributes.isEmpty()) { | ||
//the user gave a list of attributes | ||
result = new HashMap<>(userSpecifiedAttributes.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using HashMap you will ignore order of provided attributes, so you won't have control on CSV structure. If you change it to LinkedHashMap - you'll honor the order. Should come along with a change for HashSet (commented below)
for (String coreAttribute : coreAttributes) { | ||
//make sure this coreAttribute is applicable to this flowfile. | ||
String val = ff.getAttribute(coreAttribute); | ||
if(ff.getAttributes().containsKey(coreAttribute)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since ff.getAttributes() creates new object every time you call it, it's better to call it once outside the loop (before line 186).
final int atrListSize = atrList.values().size() -1; | ||
final StringBuilder sb = new StringBuilder(); | ||
for (final String val : atrList.values()) { | ||
sb.append(StringEscapeUtils.escapeCsv(val)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like StringEscapeUtils class is deprecated. It is suggested to use org.apache.commons.text.StringEscapeUtils instead of lang3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@bdesert it looks like the regex support was added to AttributesToJSON after copied the code. I'll add regex support to this processor and check if anything else has changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Test suite has been updated to test regex based attribute list. I'll build and test it later tonight on local nifi, and will update with results.
return; | ||
} | ||
if(context.getProperty(ATTRIBUTES_REGEX).isSet()) { | ||
pattern = Pattern.compile(context.getProperty(ATTRIBUTES_REGEX).evaluateAttributeExpressions(original).getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local usage of variable only. Should not be defined as field/instance member (same as attributeList).
@joetrite , @MikeThomsen , |
@bdesert good catch, we should add the option to push the header to attribute or content based on the selection. The header should also be csv format so that when the header and data are in the output file they will have the same schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add test cases to cover core attributes in schema details attribute/content (both new test cases have only regex coverage)
public class AttributesToCSV extends AbstractProcessor { | ||
private static final String OUTPUT_ATTRIBUTE_NAME = "CSVAttributes"; | ||
private static final String DATA_ATTRIBUTE_NAME = "CSVData"; | ||
private static final String SCHEMA_ATTRIBUTE_NAME = "CSVSchema"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute should be declared @ WritesAttribute(....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are style-check errors related to "if" one-line statement.
@joetrite , no more comments from my side, thanks for addressing code review comments! |
"or written in the flowfile content.") | ||
.required(true) | ||
.allowableValues(OUTPUT_NEW_ATTRIBUTE, OUTPUT_OVERWRITE_CONTENT) | ||
.defaultValue(OUTPUT_NEW_ATTRIBUTE.getDisplayName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "write to attribute" is a cool concept, I thought it might also be beneficial in other areas (including UpdateAttribute), so I wrote up https://issues.apache.org/jira/browse/NIFI-5276 to cover it (and will soon have a PR). Were it included, would it preclude you from adding such a property to this processor? If not then no worries, just wanted to mention it, in case the other Jira would take care of that use case instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind about NIFI-5276, turns out you can already do this with allMatchingAttributes. Since that is the case, do you think it is still useful to have the same type of feature here, or should it just support flowfile content as a destination? The latter would make it a little easier to configure the processor as there wouldn't be a Destination property, and the associated doc would be unnecessary as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are writting two attributes, the CSVSchema(attribute names) and the CSVData(attribute values). it looks like allMatchingAttributes will take care of the CSVData attribute nicely, but I don't see any option to handle the schema attribute. I don't see anything like allMatchingAttributeNames, which would take care of the CSVSchema part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It would be nice to have something in EL that matches on attribute names and returns them instead of evaluating them to their values, I'll take a look at what that would entail. In the meantime it's cool to have that here, running some final tests but things are looking good :)
+1 LGTM, thanks all for the reviews and @joetrite for the new feature! I ran unit tests and tried various combinations of options on a live NiFi, everything looks good, merging to master. |
New pull request replacing
#1589