Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-1705 - Adding AttributesToCSV processor #2711

Closed
wants to merge 4 commits into from

Conversation

joetrite
Copy link
Contributor

New pull request replacing
#1589

commit 9c31e45
Author: Joe Trite <joetrite@gmail.com>
Date:   Mon Mar 13 07:55:19 2017 -0400

    NIFI-1705 Adding AttributesToCSV processor

commit 5e9afa9
Merge: 3177eb1 74cbfc4
Author: Joe Trite <joetrite@gmail.com>
Date:   Sat Mar 4 08:12:39 2017 -0500

    Merge remote-tracking branch 'origin/master'

    # Conflicts:
    #	nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ScanAttribute.java

commit 3177eb1
Author: Joe Trite <joetrite@gmail.com>
Date:   Sat Mar 4 08:09:06 2017 -0500

    NIFI-3497 Converted properties to use displayName, changed validator on demarcator property and created a method for duplicate code.

commit 74cbfc4
Merge: a974f78 f8cad0f
Author: Joe Trite <joetrite@gmail.com>
Date:   Sat Mar 4 07:47:46 2017 -0500

    Merge branch 'master' into master

commit a974f78
Author: Joe Trite <joetrite@gmail.com>
Date:   Sat Mar 4 07:43:02 2017 -0500

    NIFI-3497 Converted properties to use displayName, changed validator on demarcator property and created a method for duplicate code.

commit 1bfaef8
Merge: 65ed46d 89ec68d
Author: Joe Trite <joetrite@gmail.com>
Date:   Fri Mar 3 08:01:59 2017 -0500

    Merge branch 'master' of https://github.com/joetrite/nifi

commit 65ed46d
Author: Joe Trite <joetrite@gmail.com>
Date:   Fri Feb 24 18:09:36 2017 -0500

    NIFI-3497 - fixing Pcontrib issues

commit c5d52cf
Author: Joe Trite <joetrite@gmail.com>
Date:   Thu Feb 23 10:19:01 2017 -0500

    NIFI-3497 - excluding test files

    Adding new test data files to exclude list.

commit b195934
Author: Joe Trite <joetrite@gmail.com>
Date:   Wed Feb 22 16:48:10 2017 -0500

    NIFI-3497 - New dictionary files for test

    Adding new dictionary files to support metadata dictionary option.

commit e296268
Author: Joe Trite <joetrite@gmail.com>
Date:   Wed Feb 22 16:46:13 2017 -0500

    NIFI-3497 test cases for metadata updates

    Adding test cases to support metadata option update.

commit de7e348
Author: Joe Trite <joetrite@gmail.com>
Date:   Wed Feb 22 16:36:08 2017 -0500

    NIFI-3497 - Added metadata option

    Added optional to post additional metadata as new attributed if a match is found in the dictionary.

commit 89ec68d
Author: Joe Trite <joetrite@gmail.com>
Date:   Fri Feb 24 18:09:36 2017 -0500

    NIFI-3497 - fixing Pcontrib issues

commit d714260
Author: Joe Trite <joetrite@gmail.com>
Date:   Thu Feb 23 10:19:01 2017 -0500

    NIFI-3497 - excluding test files

    Adding new test data files to exclude list.

commit a7a7b6a
Author: Joe Trite <joetrite@gmail.com>
Date:   Wed Feb 22 16:48:10 2017 -0500

    NIFI-3497 - New dictionary files for test

    Adding new dictionary files to support metadata dictionary option.

commit 8eb54a5
Author: Joe Trite <joetrite@gmail.com>
Date:   Wed Feb 22 16:46:13 2017 -0500

    NIFI-3497 test cases for metadata updates

    Adding test cases to support metadata option update.

commit f52e1f2
Author: Joe Trite <joetrite@gmail.com>
Date:   Wed Feb 22 16:36:08 2017 -0500

    NIFI-3497 - Added metadata option

    Added optional to post additional metadata as new attributed if a match is found in the dictionary.
Copy link
Contributor

@MikeThomsen MikeThomsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking solid. Needs a little cleanup.

.build();

public static final PropertyDescriptor NULL_VALUE_FOR_EMPTY_STRING = new PropertyDescriptor.Builder()
.name(("null-value"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have an extra set of parenthesis here.

"string will be placed in the CSV")
.required(true)
.allowableValues("true", "false")
.defaultValue("false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add StandardValidators.BOOLEAN_VALIDATOR here.

"contained in every FlowFile, should be included in the final CSV value generated. The Attribute List property " +
"overrides this setting.")
.required(true)
.allowableValues("true", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add StandardValidators.BOOLEAN_VALIDATOR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (ats != null) {
for (String str : ats) {
String trim = str.trim();
result.add(trim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra indentation.

testRunner.setProperty(AttributesToCSV.NULL_VALUE_FOR_EMPTY_STRING, "false");


ProcessSession session = testRunner.getProcessSessionFactory().createSession();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this. You can just do this:

Map<String, String> attrs = new HashMap<String, String>(){{
    put("x", "1");
    put("y", "2");
}};

testRunner.enqueue("some-data", attrs);

testRunner.setProperty(AttributesToCSV.NULL_VALUE_FOR_EMPTY_STRING, "false");

ProcessSession session = testRunner.getProcessSessionFactory().createSession();
FlowFile ff = session.create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about this. You can wipe out these references and use the cleaner map-driven approach.

.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.build();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if destination is the name that should be used here. It almost always is use with relationships and connections.

I know there are other processors that do attributes -or- flow file output, I just can't find one right now. But the name here should be similar to that.
@MikeThomsen do you know of one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the AttributesToJSON code so i just inherited these names. I'll change it to whatever you guys recommend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, I am wrong, we are consistent in using it inconsistently, please disregard. Sorry for the noise.

.displayName("Attribute List")
.description("Comma separated list of attributes to be included in the resulting CSV. If this value " +
"is left empty then all existing Attributes will be included. This list of attributes is " +
"case sensitive and does not support attribute names that contain commas. If an attribute specified in the list is not found it will be emitted " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, thanks for the feedback.

@bdesert
Copy link
Contributor

bdesert commented May 23, 2018

@joetrite , I would like to mention it under the current PR - there was a discussion about having "record-aware" output. This will need to be addressed, but I believe is not part of this PR, as there are use cases when having schema will limit capabilities of this processor. I.e. when list of attributes is empty, it is expected to get all flow file attributes on output, but the list is unknown, so a schema cannot be pre-defined.

@mattyb149 , as alternative, we can issue enhancement jira for AttributeToRecord to address schema-specific extracts.

@bdesert
Copy link
Contributor

bdesert commented May 23, 2018

@joetrite , while AttributesToJSON has regex support for attributes, new AttributesToCSV doesn't have it - is there a reason for not keeping consistency?

Copy link
Contributor

@bdesert bdesert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some java related questions/comments

"escaped with double quotes. Any double quote characters in the attribute value are escaped with " +
"another double quote. If the attribute value does not contain a comma, newline or double quote, then the " +
"attribute value is returned unchanged.")
@WritesAttribute(attribute = "CSVAttributes", description = "CSV representation of Attributes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case a destination is 'content', CSVAttributes won't be written at all (neither null nor empty string). We could mention that in description.

"and the 'Include Core Attributes' property is false, the core attribute will be included. The attribute list " +
"ALWAYS wins.")
.required(false)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the attribute supports EL, can we use NON_EMPTY_EL_VALIDATOR?

"or written in the flowfile content. Writing to flowfile content will overwrite any " +
"existing flowfile content.")
.required(true)
.allowableValues(OUTPUT_NEW_ATTRIBUTE, OUTPUT_OVERWRITE_CONTENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use AllowableValue Class - it's more flexible than just a string. So, on line 91 for description of "destination" you don't need to put complex explanation, instead - add description for each allowable value separately.

.build();

public static final PropertyDescriptor INCLUDE_CORE_ATTRIBUTES = new PropertyDescriptor.Builder()
.name("include-core-userSpecifiedAttributes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a name of the property suggests 'userSpecifiedAttributes', which is confusing in conjunction with prefix 'include-core-'. IMO should be just 'include-core-attributes'


private List<PropertyDescriptor> properties;
private Set<Relationship> relationships;
private volatile Set<String> userSpecifiedAttributes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable isn't being reused as object level variable. Also, you reset it every time in onTrigger, and pass as param to other methods. IMO it's better to keep it as local variable in onTrigger method.

String[] ats = StringUtils.split(attributeList, OUTPUT_SEPARATOR);
if (ats != null) {
for (String str : ats) {
String trim = str.trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no point to define local var, since it's not being reused. You can just have result.add(str.trim());


private Set<String> attributeListStringToSet(String attributeList) {
//take the user specified attribute list string and convert to list of strings.
Set<String> result = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using HashSet you will ignore order of provided attributes, so you won't have control on CSV structure. If you change it to LinkedHashSet - you'll honor the order.

Map<String, String> result;
if (!userSpecifiedAttributes.isEmpty()) {
//the user gave a list of attributes
result = new HashMap<>(userSpecifiedAttributes.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using HashMap you will ignore order of provided attributes, so you won't have control on CSV structure. If you change it to LinkedHashMap - you'll honor the order. Should come along with a change for HashSet (commented below)

for (String coreAttribute : coreAttributes) {
//make sure this coreAttribute is applicable to this flowfile.
String val = ff.getAttribute(coreAttribute);
if(ff.getAttributes().containsKey(coreAttribute)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since ff.getAttributes() creates new object every time you call it, it's better to call it once outside the loop (before line 186).

final int atrListSize = atrList.values().size() -1;
final StringBuilder sb = new StringBuilder();
for (final String val : atrList.values()) {
sb.append(StringEscapeUtils.escapeCsv(val));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like StringEscapeUtils class is deprecated. It is suggested to use org.apache.commons.text.StringEscapeUtils instead of lang3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@joetrite
Copy link
Contributor Author

@bdesert it looks like the regex support was added to AttributesToJSON after copied the code. I'll add regex support to this processor and check if anything else has changed.

Copy link
Contributor

@bdesert bdesert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Test suite has been updated to test regex based attribute list. I'll build and test it later tonight on local nifi, and will update with results.

return;
}
if(context.getProperty(ATTRIBUTES_REGEX).isSet()) {
pattern = Pattern.compile(context.getProperty(ATTRIBUTES_REGEX).evaluateAttributeExpressions(original).getValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local usage of variable only. Should not be defined as field/instance member (same as attributeList).

@bdesert
Copy link
Contributor

bdesert commented May 25, 2018

@joetrite , @MikeThomsen ,
So, I tested the processor. It works OK.
But I've got a question.
In JSON it's easy - json structure will include both attribute name and attribute value. in CSV - there will be only values on the output. Don't you think it would be useful to add a header, or attribute with avro-like generated schema? I agree that specified attributes will be in provided order, but if regex is used - then user won't be able to relate a value to a name. The same is about core attributes - since we add only existing not empty.
If we decide to add avro-like schema, we gonna have problems - attribute names could be non Avro-safe.

@joetrite
Copy link
Contributor Author

@bdesert good catch, we should add the option to push the header to attribute or content based on the selection. The header should also be csv format so that when the header and data are in the output file they will have the same schema.

Copy link
Contributor

@bdesert bdesert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test cases to cover core attributes in schema details attribute/content (both new test cases have only regex coverage)

public class AttributesToCSV extends AbstractProcessor {
private static final String OUTPUT_ATTRIBUTE_NAME = "CSVAttributes";
private static final String DATA_ATTRIBUTE_NAME = "CSVData";
private static final String SCHEMA_ATTRIBUTE_NAME = "CSVSchema";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute should be declared @ WritesAttribute(....)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are style-check errors related to "if" one-line statement.

@bdesert
Copy link
Contributor

bdesert commented May 28, 2018

@joetrite , no more comments from my side, thanks for addressing code review comments!
@MikeThomsen , LGTM. Could you please give a second look on it now? I believe it's ready, but there are issues with build, which seems not related to this PR.

"or written in the flowfile content.")
.required(true)
.allowableValues(OUTPUT_NEW_ATTRIBUTE, OUTPUT_OVERWRITE_CONTENT)
.defaultValue(OUTPUT_NEW_ATTRIBUTE.getDisplayName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "write to attribute" is a cool concept, I thought it might also be beneficial in other areas (including UpdateAttribute), so I wrote up https://issues.apache.org/jira/browse/NIFI-5276 to cover it (and will soon have a PR). Were it included, would it preclude you from adding such a property to this processor? If not then no worries, just wanted to mention it, in case the other Jira would take care of that use case instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind about NIFI-5276, turns out you can already do this with allMatchingAttributes. Since that is the case, do you think it is still useful to have the same type of feature here, or should it just support flowfile content as a destination? The latter would make it a little easier to configure the processor as there wouldn't be a Destination property, and the associated doc would be unnecessary as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are writting two attributes, the CSVSchema(attribute names) and the CSVData(attribute values). it looks like allMatchingAttributes will take care of the CSVData attribute nicely, but I don't see any option to handle the schema attribute. I don't see anything like allMatchingAttributeNames, which would take care of the CSVSchema part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It would be nice to have something in EL that matches on attribute names and returns them instead of evaluating them to their values, I'll take a look at what that would entail. In the meantime it's cool to have that here, running some final tests but things are looking good :)

@mattyb149
Copy link
Contributor

+1 LGTM, thanks all for the reviews and @joetrite for the new feature! I ran unit tests and tried various combinations of options on a live NiFi, everything looks good, merging to master.

@asfgit asfgit closed this in f536487 Jun 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants