Skip to content

NIFI-2613 - Apache POI processor to convert Excel documents to CSV#929

Closed
jdye64 wants to merge 10 commits intoapache:masterfrom
jdye64:NIFI-2613
Closed

NIFI-2613 - Apache POI processor to convert Excel documents to CSV#929
jdye64 wants to merge 10 commits intoapache:masterfrom
jdye64:NIFI-2613

Conversation

@jdye64
Copy link
Contributor

@jdye64 jdye64 commented Aug 24, 2016

No description provided.

@jvwing
Copy link
Contributor

jvwing commented Oct 13, 2016

Reviewing

@jvwing
Copy link
Contributor

jvwing commented Oct 13, 2016

@jdye64 I have a couple of requests from a quick scan. Would it be possible to...

  • Rebase on the latest master?
  • Update the POM versions, using a consistent "1.1.0-SNAPSHOT" format (looks like there is a mix of "1.0-SNAPSHOT" and "1.0.0-SNAPSHOT")?
  • Reference the nifi-poi-nar in the nifi-assembly dependency artifacts?
  • Include Poi in the nifi-assembly NOTICE file?

@jvwing
Copy link
Contributor

jvwing commented Oct 18, 2016

@jdye64 I still had trouble building and running NiFi with this PR.

Would you please try running mvn clean install -Pcontrib-check, verifying that the latest nifi-poi-nar is included in the lib output, and that NiFi starts OK?

@jvwing
Copy link
Contributor

jvwing commented Oct 18, 2016

I have a few more comments from running the processor and reviewing the code:

Processor Annotations

  • The CapabilityDescription says this processor transfers one flowfile for each sheet in the Excel document, but that was not my experience. It aggregated all content into one output flowfile. Is that the intended behavior?
  • Most NiFi attribute names are lowercase, "sheetname" or "sheet.name" instead of "SheetName". I don't believe there is a hard and fast rule there, but I recommend going lowercase if you do not have a strong preference.

Properties

  • PropertyDescriptors should have name as a machine readable key like "extract-sheets" and displayName for the user "Sheets to Extract".

Exception Handling and Logging
Most NiFi components use getLogger().error(... as you do later in the code, rather than printStackTrace(). I haven't tested these error cases, but they appear to be informational rather than stopping or failing the flow. Would getLogger().info(... or getLogger().debug(... be a better fit here? Do users need the full stack trace, or would just the message be helpful enough?

} catch (InvalidFormatException e) {
    e.printStackTrace();
} catch (OpenXML4JException e) {
    e.printStackTrace();
} catch (SAXException e) {
    e.printStackTrace();
}

@joewitt
Copy link
Contributor

joewitt commented Feb 14, 2017

@jdye64 will you have time to address @jvwing findings? Would like to get this across the line. We're trying to get the list of stale PRs burned down.

@jdye64
Copy link
Contributor Author

jdye64 commented Feb 14, 2017

@jvwing wow sorry its been a long time for my response. I've tried to go through and make all of the cleanup suggestions you have made.

Yes, the intention is a single output per sheet in the excel file. Keep in mind that if the sheet is not .xlsx and rather .xls format you would see the behavior you are experiencing. Can you attach the excel doc you were testing with?

I made the change to case of the attribute

Also adjusted the error handling to prevent the end user from being pummeled with a full stack trace.

@joewitt
Copy link
Contributor

joewitt commented Feb 14, 2017

@jvwing you ok continuing to run with this one for the review?

@jvwing
Copy link
Contributor

jvwing commented Feb 14, 2017

@joewitt Yes, thanks for getting us moving, I'll continue reviewing.

@jdye64 Thanks for the updates, I'll take another look shortly.

Apache Kafka
Copyright 2012 The Apache Software Foundation.

(ASLv2) Apache POI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the notice info.

# See the License for the specific language governing permissions and
# limitations under the License.
org.apache.nifi.processors.kite.StoreInKiteDataset
org.apache.nifi.processors.kite.ConvertAvroToParquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is unnecessary or part of a different feature? Can this be removed from the PR?

<type>nar</type>
</dependency>
<dependency>
<groupId>org.apache.nifi</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 1.2.0-SNAPSHOT?

@CapabilityDescription("Consumes a Microsoft Excel document and converts each worksheet to csv. Each sheet from the incoming Excel " +
"document will generate a new Flowfile that will be output from this processor. Each output Flowfile's contents will be formatted as a csv file " +
"where the each row from the excel sheet is output as a newline in the csv file.")
@WritesAttributes({@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular row of data came from should be "this flowfile came from"?

"document will generate a new Flowfile that will be output from this processor. Each output Flowfile's contents will be formatted as a csv file " +
"where the each row from the excel sheet is output as a newline in the csv file.")
@WritesAttributes({@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document"),
@WritesAttribute(attribute="numrows", description="The number of rows in this Excel Sheet"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify if this is the number of rows in the input spreadsheet, or output rows in the flowfile?

}

@Test
public void testProcessASpecificSheetThatDoesNotExist() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be testProcessASpecificSheetThatDoesExist? The spreadsheet does have a "Scorecard" sheet, and the test asserts that rows are processed.

}

/**
* We do want to allow this to be a success relationship because if arbitrary Excel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a cliffhanger comment, I would like to read the ending :)

@jvwing
Copy link
Contributor

jvwing commented Feb 16, 2017

@jdye64 I'm still confused by the disparity between the annotation docs and the behavior of the processor, especially with respect to multiple sheets in the incoming spreadsheet. From the annotations:

Each sheet from the incoming Excel document will generate a new Flowfile that will be output from this processor

This was not my experience, and I believe the single call to session.clone on line 143, not performed in a loop, guarantees only a single flowfile to the success relationship.

@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document")

But it has always been UNKNOWN for my tests.


I created a unit test to help clarify how this is supposed to work. From the docs, I believe the test below should pass, although it currently fails. Am I misunderstanding?

Also see branch
TwoSheets.xlsx

@Test
public void testMultipleSheetsGeneratesMultipleFlowFiles() throws Exception {

    testRunner.enqueue(new File("src/test/resources/TwoSheets.xlsx").toPath());
    testRunner.run();

    testRunner.assertTransferCount(ConvertExcelToCSVProcessor.SUCCESS, 2);
    testRunner.assertTransferCount(ConvertExcelToCSVProcessor.ORIGINAL, 1);
    testRunner.assertTransferCount(ConvertExcelToCSVProcessor.FAILURE, 0);

    MockFlowFile ffSheetA = testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(0);
    Long rowsSheetA = new Long(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM));
    assertTrue(rowsSheetA == 4l);
    assertTrue(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetA"));

    MockFlowFile ffSheetB = testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(1);
    Long rowsSheetB = new Long(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM));
    assertTrue(rowsSheetB == 3l);
    assertTrue(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetB"));
}

@jdye64
Copy link
Contributor Author

jdye64 commented Feb 16, 2017

@jvwing thank you for pointing this out. Let me look it over right now and I'll make another commit shortly. I had this in another project and I think during the copy/paste I have made an error.

@jdye64
Copy link
Contributor Author

jdye64 commented Feb 24, 2017

@jvwing I indeed had used a much earlier version of the processor. I couldn't find the other version so just ended up doing a rewrite for the most part. Thanks for hanging with me on this.

@jvwing
Copy link
Contributor

jvwing commented Feb 25, 2017

Thanks, @jdye64, this is clearly a big improvement. I will continue reviewing the functionality. Some of the issues identified above are still in the latest commit, and should still be addressed:

  • Root POM references nifi-poi-bundle version 1.0-SNAPSHOT
  • Entry for ConvertAvroToParquet in nifi-kite-bundle
  • Incomplete comment at ConvertExcelToCSVProcessorTest.java:91

@jdye64
Copy link
Contributor Author

jdye64 commented Feb 27, 2017

@jvwing the ConvertAvroToParquet had slipped my mind thanks for the reminder there. The POM references had been fixed in a previous fix at 474a08e#diff-e7171c7f07ed58f51e4643dc340432f8R22

I'm unsure what your preference for reviewing is but let me know if this is getting to the point where you would like for me to squash the commits or not.

@jvwing
Copy link
Contributor

jvwing commented Feb 27, 2017

@jdye64 Thanks for the update. I'll look into why I'm seeing the POM issue. Don't worry about squashing, I'll take care of that when we're ready to merge.

@jvwing
Copy link
Contributor

jvwing commented Feb 27, 2017

@jdye64, I am still having trouble with the root pom.xml file. The version number is still 1.0.0-SNAPSHOT on line 1006. The commit you referenced above did correct version numbers in the nifi-poi-bundle project, but not the root NiFi project POM.

How are you testing the build and the build output? I have experienced two scenarios:

1.) Clean System - Build fails at nifi-assembly, because the root pom.xml specifies nifi-poi-nar 1.0.0-SNAPSHOT, which is not in the local Maven repository:

[ERROR] Failed to execute goal on project nifi-assembly: Could not resolve dependencies for project org.apache.nifi:nifi-assembly:pom:1.2.0-SNAPSHOT: Could not find artifact org.apache.nifi:nifi-poi-nar:nar:1.0.0-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots)

2.) Reused System - Build succeeds because the older nifi-poi-nar-1.0.0-SNAPSHOT.nar still exists in the local Maven repository, but the output assembly files contain nifi-poi-nar-1.0.0-SNAPSHOT.nar.

In neither case do I get the lastest NAR output. Would you please try deleting the contents of ~/.m2/repository/org/apache/nifi/nifi-poi* and running a full mvn clean install?

earlier and thought you were referencing /nifi-nar-bundles/pom.xml
rather than /pom.xml. And your previous comment was def. correct as my
build was not failing because of the previous Maven cache. I have
corrected the version issue now.

PS - Your a patient man =)
@jvwing
Copy link
Contributor

jvwing commented Feb 28, 2017

Thanks, @jdye64, that got it. I was able to build just fine with contrib-check and everything. On to new topics:

  • License/Notice - We need a separate LICENSE and NOTICE files for nifi-poi-nar. I was reminded by the recent LICENSE/NOTICE discussion on the mailing list about what to include in NAR bundles, and it applies here. Good news, you've already put together most of the text for the nifi-assembly LICENSE/NOTICE, and there are many examples of the file layout in other NARs. I think a reasonably comparable NAR referencing another Apache project is nifi-ignite-nar.
  • Sheet in Filename - Output filenames all have the UNKNOWN sheet name, like 35200238025098_UNKNOWN.csv. How about adding a sheet name check to one of the unit tests?
  • Error Cases - What will make this fail?

@jdye64
Copy link
Contributor Author

jdye64 commented Mar 15, 2017

@jvwing In regards to error cases here are a list of things which could make this fail.

  • Input data is incorrect. AKA not .xlsx file
  • Empty document. I guess this isn't an error but I think you get what I'm saying there just wouldn't be anything to do.
  • User could specify a list of "desired-sheets" that are not present in the document. Once again not really failure but of course no success relationships would be triggered on the original relationship.
  • The processor will currently only process strings (this includes numerical values, just the POI naming convention) and will ignore things like graphs, images, binary data, etc.

Copy link
Contributor

@jvwing jvwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend including comments with the exception logging to provide some context for the exception.

try {
outputStream.write(currentContent.getBytes());
} catch (IOException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to logging

try {
outputStream.write(("," + currentContent).getBytes());
} catch (IOException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to logging

rowCount++;
outputStream.write("\n".getBytes());
} catch (IOException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to logging

try {
sheetInputStream.close();
} catch (IOException ioe) {
//nothing further can be done...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to catch it here if we can't do anything? How about not catching the exception and letting the handlers in onTrigger get it?

session.transfer(ff, SUCCESS);

} catch (SAXException saxE) {
getLogger().error(saxE.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

parser.parse(sheetSource);
sheetInputStream.close();
} catch (Exception ex) {
getLogger().error(ex.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

});

} catch (Exception ex) {
getLogger().error(ex.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

}
}
} catch (Exception ex) {
getLogger().error(ex.getMessage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

@jvwing
Copy link
Contributor

jvwing commented Mar 16, 2017

@jdye64 Thanks, I looked into some of these error cases:

  • Unsupported .xls - Throws exception, routes to 'original'. The error bulletin seems a bit terse:

ConvertExcelToCSVProcessor[id=43426b41-015a-1000-b06e-3a9be79162d1] Package should contain a content type part [M1.13]

  • Blank Sheet - Succeeds, one empty flowfile to 'success' and the original to 'original'. No Errors. Seems correct to me.
  • Empty Workbook - I wasn't able to create an empty workbook manually in Excel.
  • Unmatched Sheet Names - Covered by testNonExistantSpecifiedSheetName, seems correct to me.
  • Diverse Content - I tried a number of bizarre things Excel lets you put in spreadsheets -- images, tables, pivot tables, formulas, hidden sheets, etc. Where appropriate, the processor returned content (cells in tables and pivot tables, last computed formula value), but did not run into errors. Images were ignored.
  • CSV-Breaking Content - The processor does not escape text to enforce a CSV structure. Sheets containing multiline text in a cell, cells with commas, etc. resulted in improperly formed CSV files. I do not object, I understand that's a significant scope increase.

I recommend that you do the following:

  1. Clearly document that the processor only supports .xlsx and NOT .xls files. I know you've already answered questions about this on the dev list, so somebody's going to try it.
  2. For the .xls case, would it be possible to catch the org.apache.poi.openxml4j.exceptions.InvalidFormatException, repackage that with a helpful error message suggesting that only well-formed XLSX files are accepted, and route to failure?
  3. Document that the processor does not escape invalid CSV content.
  4. Logging changes from the code review post above.

@jvwing
Copy link
Contributor

jvwing commented Mar 17, 2017

Thanks for those improvements, @jdye64, I especially like the updated usage doc. Two things on the latest code:

  1. Did you try an .xls file? There is a problem when the flowfile attribute is added in the catch block on line ~195. The NiFi framework throws an exception of it's own, because we can't do session.putAttribute inside an InputStreamCallback for the same flowfile:

java.lang.IllegalStateException: StandardFlowFileRecord[uuid=be192381-9475-4c6d-a6ca-43735e5df271,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489713904793-1, container=default, section=1], offset=0, length=26112],offset=0,name=./conf/test-xls.xls,size=26112] already in use for an active callback or InputStream created by ProcessSession.read(FlowFile) has not been closed

Something similar happens with the session.putAttribute on ~209. As a result of these exceptions, the session is rolled back and the flowfile is returned to the input queue. I think we can throw an exception, though. So if we caught and rethrew with a different error message, it should work out.

  1. In the failure case, we're routing the flowfile to both 'failure' and 'original'. I didn't realize it earlier, but I now believe this to be unusual in NiFi. Most processors treat failure as an exclusive route, and 'original' as part of the successful happy path. SplitAvro, SplitJson, SplitText, and UnpackContent were some examples I looked at. I doubt that's written in stone. What do you think?

I made a sample code fork with a unit test for .xls and a suggested approach to solving the IllegalStateExceptions, and the failure routing. I did not get the logging to cooperate the way I think it should, but we're not too far off.

@jvwing
Copy link
Contributor

jvwing commented Mar 17, 2017

@jdye64, it looks good! I was just about to merge, giving the files a last once-over before I push the button... and I noticed we have a json.org dependency in the nifi-poi-processors POM:

    <dependency>
        <groupId>org.json</groupId>
        <artifactId>json</artifactId>
        <version>20160810</version>
    </dependency>

Are we using that? json.org is Category X these days, so we can't go with it. But it doesn't appear to be used, and when I deleted it I was able to build, test, and run the processor without obvious issues.

Do you know of any reason I can't just pull that out?

else I was planning on adding and instead removed and must have missed
the dependency.
@asfgit asfgit closed this in d05727b Mar 17, 2017
@jvwing
Copy link
Contributor

jvwing commented Mar 17, 2017

Thanks again for putting this together, @jdye64.

josephxsxn pushed a commit to josephxsxn/nifi that referenced this pull request Mar 23, 2017
Signed-off-by: James Wing <jvwing@gmail.com>

This closes apache#929.
aperepel pushed a commit to aperepel/nifi that referenced this pull request Mar 29, 2017
Signed-off-by: James Wing <jvwing@gmail.com>

This closes apache#929.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants