NIFI-2613 - Apache POI processor to convert Excel documents to CSV by jdye64 · Pull Request #929 · apache/nifi

jdye64 · 2016-08-24T15:53:05Z

No description provided.

jvwing · 2016-10-13T17:04:56Z

Reviewing

jvwing · 2016-10-13T20:45:05Z

@jdye64 I have a couple of requests from a quick scan. Would it be possible to...

Rebase on the latest master?
Update the POM versions, using a consistent "1.1.0-SNAPSHOT" format (looks like there is a mix of "1.0-SNAPSHOT" and "1.0.0-SNAPSHOT")?
Reference the nifi-poi-nar in the nifi-assembly dependency artifacts?
Include Poi in the nifi-assembly NOTICE file?

jvwing · 2016-10-18T20:06:18Z

@jdye64 I still had trouble building and running NiFi with this PR.

root POM still references nifi-poi-bundle version "1.0-SNAPSHOT"
nifi-poi-nar still has a dependency on nifi-poi-processors version "1.0-SNAPSHOT"
NiFi fails to start processing the entry for org.apache.nifi.processors.kite.ConvertAvroToParquet in nifi-nar-bundles/nifi-kite-bundle/nifi-kite-processors/src/main/resources/META-INF/services/org.apache.nifi.processor.Processor (unintentionally included in commit?)

Would you please try running mvn clean install -Pcontrib-check, verifying that the latest nifi-poi-nar is included in the lib output, and that NiFi starts OK?

jvwing · 2016-10-18T20:23:18Z

I have a few more comments from running the processor and reviewing the code:

Processor Annotations

The CapabilityDescription says this processor transfers one flowfile for each sheet in the Excel document, but that was not my experience. It aggregated all content into one output flowfile. Is that the intended behavior?
Most NiFi attribute names are lowercase, "sheetname" or "sheet.name" instead of "SheetName". I don't believe there is a hard and fast rule there, but I recommend going lowercase if you do not have a strong preference.

Properties

PropertyDescriptors should have name as a machine readable key like "extract-sheets" and displayName for the user "Sheets to Extract".

Exception Handling and Logging
Most NiFi components use getLogger().error(... as you do later in the code, rather than printStackTrace(). I haven't tested these error cases, but they appear to be informational rather than stopping or failing the flow. Would getLogger().info(... or getLogger().debug(... be a better fit here? Do users need the full stack trace, or would just the message be helpful enough?

} catch (InvalidFormatException e) {
    e.printStackTrace();
} catch (OpenXML4JException e) {
    e.printStackTrace();
} catch (SAXException e) {
    e.printStackTrace();
}

nidi-assembly, and added ASLv2 notice to Apache POI

joewitt · 2017-02-14T19:52:39Z

@jdye64 will you have time to address @jvwing findings? Would like to get this across the line. We're trying to get the list of stale PRs burned down.

jdye64 · 2017-02-14T19:58:06Z

@jvwing wow sorry its been a long time for my response. I've tried to go through and make all of the cleanup suggestions you have made.

Yes, the intention is a single output per sheet in the excel file. Keep in mind that if the sheet is not .xlsx and rather .xls format you would see the behavior you are experiencing. Can you attach the excel doc you were testing with?

I made the change to case of the attribute

Also adjusted the error handling to prevent the end user from being pummeled with a full stack trace.

joewitt · 2017-02-14T20:01:54Z

@jvwing you ok continuing to run with this one for the review?

jvwing · 2017-02-14T21:07:51Z

@joewitt Yes, thanks for getting us moving, I'll continue reviewing.

@jdye64 Thanks for the updates, I'll take another look shortly.

jvwing · 2017-02-15T21:49:21Z

nifi-assembly/NOTICE

      Apache Kafka
      Copyright 2012 The Apache Software Foundation.

+  (ASLv2) Apache POI


Thanks for adding the notice info.

jvwing · 2017-02-15T21:50:24Z

...ifi-kite-processors/src/main/resources/META-INF/services/org.apache.nifi.processor.Processor

 # See the License for the specific language governing permissions and
 # limitations under the License.
 org.apache.nifi.processors.kite.StoreInKiteDataset
+org.apache.nifi.processors.kite.ConvertAvroToParquet


I believe this is unnecessary or part of a different feature? Can this be removed from the PR?

jvwing · 2017-02-15T21:51:41Z

pom.xml

                <type>nar</type>
            </dependency>
+            <dependency>
+                <groupId>org.apache.nifi</groupId>


Should be 1.2.0-SNAPSHOT?

jvwing · 2017-02-15T21:55:45Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+@CapabilityDescription("Consumes a Microsoft Excel document and converts each worksheet to csv. Each sheet from the incoming Excel " +
+        "document will generate a new Flowfile that will be output from this processor. Each output Flowfile's contents will be formatted as a csv file " +
+        "where the each row from the excel sheet is output as a newline in the csv file.")
+@WritesAttributes({@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document"),


this particular row of data came from should be "this flowfile came from"?

jvwing · 2017-02-15T21:56:33Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+        "document will generate a new Flowfile that will be output from this processor. Each output Flowfile's contents will be formatted as a csv file " +
+        "where the each row from the excel sheet is output as a newline in the csv file.")
+@WritesAttributes({@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document"),
+        @WritesAttribute(attribute="numrows", description="The number of rows in this Excel Sheet"),


Can we clarify if this is the number of rows in the input spreadsheet, or output rows in the flowfile?

jvwing · 2017-02-16T00:06:06Z

...-processors/src/test/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessorTest.java

+    }
+
+    @Test
+    public void testProcessASpecificSheetThatDoesNotExist() throws Exception {


Should this be testProcessASpecificSheetThatDoesExist? The spreadsheet does have a "Scorecard" sheet, and the test asserts that rows are processed.

jvwing · 2017-02-16T00:06:55Z

...-processors/src/test/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessorTest.java

+    }
+
+    /**
+     * We do want to allow this to be a success relationship because if arbitrary Excel


Seems like a cliffhanger comment, I would like to read the ending :)

jvwing · 2017-02-16T00:34:29Z

@jdye64 I'm still confused by the disparity between the annotation docs and the behavior of the processor, especially with respect to multiple sheets in the incoming spreadsheet. From the annotations:

Each sheet from the incoming Excel document will generate a new Flowfile that will be output from this processor

This was not my experience, and I believe the single call to session.clone on line 143, not performed in a loop, guarantees only a single flowfile to the success relationship.

@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document")

But it has always been UNKNOWN for my tests.

I created a unit test to help clarify how this is supposed to work. From the docs, I believe the test below should pass, although it currently fails. Am I misunderstanding?

Also see branch
TwoSheets.xlsx

@Test
public void testMultipleSheetsGeneratesMultipleFlowFiles() throws Exception {

    testRunner.enqueue(new File("src/test/resources/TwoSheets.xlsx").toPath());
    testRunner.run();

    testRunner.assertTransferCount(ConvertExcelToCSVProcessor.SUCCESS, 2);
    testRunner.assertTransferCount(ConvertExcelToCSVProcessor.ORIGINAL, 1);
    testRunner.assertTransferCount(ConvertExcelToCSVProcessor.FAILURE, 0);

    MockFlowFile ffSheetA = testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(0);
    Long rowsSheetA = new Long(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM));
    assertTrue(rowsSheetA == 4l);
    assertTrue(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetA"));

    MockFlowFile ffSheetB = testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(1);
    Long rowsSheetB = new Long(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM));
    assertTrue(rowsSheetB == 3l);
    assertTrue(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetB"));
}

jdye64 · 2017-02-16T13:44:46Z

@jvwing thank you for pointing this out. Let me look it over right now and I'll make another commit shortly. I had this in another project and I think during the copy/paste I have made an error.

jdye64 · 2017-02-24T14:46:03Z

@jvwing I indeed had used a much earlier version of the processor. I couldn't find the other version so just ended up doing a rewrite for the most part. Thanks for hanging with me on this.

jvwing · 2017-02-25T19:55:49Z

Thanks, @jdye64, this is clearly a big improvement. I will continue reviewing the functionality. Some of the issues identified above are still in the latest commit, and should still be addressed:

Root POM references nifi-poi-bundle version 1.0-SNAPSHOT
Entry for ConvertAvroToParquet in nifi-kite-bundle
Incomplete comment at ConvertExcelToCSVProcessorTest.java:91

jdye64 · 2017-02-27T15:09:39Z

@jvwing the ConvertAvroToParquet had slipped my mind thanks for the reminder there. The POM references had been fixed in a previous fix at 474a08e#diff-e7171c7f07ed58f51e4643dc340432f8R22

I'm unsure what your preference for reviewing is but let me know if this is getting to the point where you would like for me to squash the commits or not.

jvwing · 2017-02-27T16:50:19Z

@jdye64 Thanks for the update. I'll look into why I'm seeing the POM issue. Don't worry about squashing, I'll take care of that when we're ready to merge.

jvwing · 2017-02-27T18:25:48Z

@jdye64, I am still having trouble with the root pom.xml file. The version number is still 1.0.0-SNAPSHOT on line 1006. The commit you referenced above did correct version numbers in the nifi-poi-bundle project, but not the root NiFi project POM.

How are you testing the build and the build output? I have experienced two scenarios:

1.) Clean System - Build fails at nifi-assembly, because the root pom.xml specifies nifi-poi-nar 1.0.0-SNAPSHOT, which is not in the local Maven repository:

[ERROR] Failed to execute goal on project nifi-assembly: Could not resolve dependencies for project org.apache.nifi:nifi-assembly:pom:1.2.0-SNAPSHOT: Could not find artifact org.apache.nifi:nifi-poi-nar:nar:1.0.0-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots)

2.) Reused System - Build succeeds because the older nifi-poi-nar-1.0.0-SNAPSHOT.nar still exists in the local Maven repository, but the output assembly files contain nifi-poi-nar-1.0.0-SNAPSHOT.nar.

In neither case do I get the lastest NAR output. Would you please try deleting the contents of ~/.m2/repository/org/apache/nifi/nifi-poi* and running a full mvn clean install?

earlier and thought you were referencing /nifi-nar-bundles/pom.xml rather than /pom.xml. And your previous comment was def. correct as my build was not failing because of the previous Maven cache. I have corrected the version issue now. PS - Your a patient man =)

jvwing · 2017-02-28T05:05:19Z

Thanks, @jdye64, that got it. I was able to build just fine with contrib-check and everything. On to new topics:

License/Notice - We need a separate LICENSE and NOTICE files for nifi-poi-nar. I was reminded by the recent LICENSE/NOTICE discussion on the mailing list about what to include in NAR bundles, and it applies here. Good news, you've already put together most of the text for the nifi-assembly LICENSE/NOTICE, and there are many examples of the file layout in other NARs. I think a reasonably comparable NAR referencing another Apache project is nifi-ignite-nar.
Sheet in Filename - Output filenames all have the UNKNOWN sheet name, like 35200238025098_UNKNOWN.csv. How about adding a sheet name check to one of the unit tests?
Error Cases - What will make this fail?

jdye64 · 2017-03-15T14:14:20Z

@jvwing In regards to error cases here are a list of things which could make this fail.

Input data is incorrect. AKA not .xlsx file
Empty document. I guess this isn't an error but I think you get what I'm saying there just wouldn't be anything to do.
User could specify a list of "desired-sheets" that are not present in the document. Once again not really failure but of course no success relationships would be triggered on the original relationship.
The processor will currently only process strings (this includes numerical values, just the POI naming convention) and will ignore things like graphs, images, binary data, etc.

jvwing

I recommend including comments with the exception logging to provide some context for the exception.

jvwing · 2017-03-16T23:12:09Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+                    try {
+                        outputStream.write(currentContent.getBytes());
+                    } catch (IOException e) {
+                        e.printStackTrace();


Please change to logging

jvwing · 2017-03-16T23:12:56Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+                    try {
+                        outputStream.write(("," + currentContent).getBytes());
+                    } catch (IOException e) {
+                        e.printStackTrace();


Please change to logging

jvwing · 2017-03-16T23:13:11Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+                        rowCount++;
+                        outputStream.write("\n".getBytes());
+                    } catch (IOException e) {
+                        e.printStackTrace();


Please change to logging

jvwing · 2017-03-16T23:16:09Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+            try {
+                sheetInputStream.close();
+            } catch (IOException ioe) {
+                //nothing further can be done...


Does it make sense to catch it here if we can't do anything? How about not catching the exception and letting the handlers in onTrigger get it?

jvwing · 2017-03-16T23:17:50Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+            session.transfer(ff, SUCCESS);
+
+        } catch (SAXException saxE) {
+            getLogger().error(saxE.getMessage());


Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

jvwing · 2017-03-16T23:18:06Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+                        parser.parse(sheetSource);
+                        sheetInputStream.close();
+                    } catch (Exception ex) {
+                        getLogger().error(ex.getMessage());


Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

jvwing · 2017-03-16T23:18:21Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+            });
+
+        } catch (Exception ex) {
+            getLogger().error(ex.getMessage());


Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

jvwing · 2017-03-16T23:18:28Z

...-poi-processors/src/main/java/org/apache/nifi/processors/poi/ConvertExcelToCSVProcessor.java

+                            }
+                        }
+                    } catch (Exception ex) {
+                        getLogger().error(ex.getMessage());


Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).

jvwing · 2017-03-16T23:33:46Z

@jdye64 Thanks, I looked into some of these error cases:

Unsupported .xls - Throws exception, routes to 'original'. The error bulletin seems a bit terse:

ConvertExcelToCSVProcessor[id=43426b41-015a-1000-b06e-3a9be79162d1] Package should contain a content type part [M1.13]

Blank Sheet - Succeeds, one empty flowfile to 'success' and the original to 'original'. No Errors. Seems correct to me.
Empty Workbook - I wasn't able to create an empty workbook manually in Excel.
Unmatched Sheet Names - Covered by testNonExistantSpecifiedSheetName, seems correct to me.
Diverse Content - I tried a number of bizarre things Excel lets you put in spreadsheets -- images, tables, pivot tables, formulas, hidden sheets, etc. Where appropriate, the processor returned content (cells in tables and pivot tables, last computed formula value), but did not run into errors. Images were ignored.
CSV-Breaking Content - The processor does not escape text to enforce a CSV structure. Sheets containing multiline text in a cell, cells with commas, etc. resulted in improperly formed CSV files. I do not object, I understand that's a significant scope increase.

I recommend that you do the following:

Clearly document that the processor only supports .xlsx and NOT .xls files. I know you've already answered questions about this on the dev list, so somebody's going to try it.
For the .xls case, would it be possible to catch the org.apache.poi.openxml4j.exceptions.InvalidFormatException, repackage that with a helpful error message suggesting that only well-formed XLSX files are accepted, and route to failure?
Document that the processor does not escape invalid CSV content.
Logging changes from the code review post above.

jvwing · 2017-03-17T04:54:13Z

Thanks for those improvements, @jdye64, I especially like the updated usage doc. Two things on the latest code:

Did you try an .xls file? There is a problem when the flowfile attribute is added in the catch block on line ~195. The NiFi framework throws an exception of it's own, because we can't do session.putAttribute inside an InputStreamCallback for the same flowfile:

java.lang.IllegalStateException: StandardFlowFileRecord[uuid=be192381-9475-4c6d-a6ca-43735e5df271,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489713904793-1, container=default, section=1], offset=0, length=26112],offset=0,name=./conf/test-xls.xls,size=26112] already in use for an active callback or InputStream created by ProcessSession.read(FlowFile) has not been closed

Something similar happens with the session.putAttribute on ~209. As a result of these exceptions, the session is rolled back and the flowfile is returned to the input queue. I think we can throw an exception, though. So if we caught and rethrew with a different error message, it should work out.

In the failure case, we're routing the flowfile to both 'failure' and 'original'. I didn't realize it earlier, but I now believe this to be unusual in NiFi. Most processors treat failure as an exclusive route, and 'original' as part of the successful happy path. SplitAvro, SplitJson, SplitText, and UnpackContent were some examples I looked at. I doubt that's written in stone. What do you think?

I made a sample code fork with a unit test for .xls and a suggested approach to solving the IllegalStateExceptions, and the failure routing. I did not get the logging to cooperate the way I think it should, but we're not too far off.

jvwing · 2017-03-17T17:28:03Z

@jdye64, it looks good! I was just about to merge, giving the files a last once-over before I push the button... and I noticed we have a json.org dependency in the nifi-poi-processors POM:

    <dependency>
        <groupId>org.json</groupId>
        <artifactId>json</artifactId>
        <version>20160810</version>
    </dependency>

Are we using that? json.org is Category X these days, so we can't go with it. But it doesn't appear to be used, and when I deleted it I was able to build, test, and run the processor without obvious issues.

Do you know of any reason I can't just pull that out?

else I was planning on adding and instead removed and must have missed the dependency.

jvwing · 2017-03-17T17:53:24Z

Thanks again for putting this together, @jdye64.

Signed-off-by: James Wing <jvwing@gmail.com> This closes apache#929.

jdye64 added 2 commits February 14, 2017 14:28

NIFI-2613 - Apache POI processor to convert Excel documents to CSV

f463f55

Normalized poi pom version to 1.1.0-SNAPSHOT, added poi-nar to

b95b595

nidi-assembly, and added ASLv2 notice to Apache POI

updates per @jvwing suggestions

474a08e

jvwing reviewed Feb 16, 2017

View reviewed changes

Resolution for several erroneous things

e00c14c

Updates

998b0e1

Updates per @jvwing suggestions

0cf7167

jvwing reviewed Mar 16, 2017

View reviewed changes

Logging and documentation updates suggested by @jvwing

fb64799

more changes per @jvwing recommendations

29f98ea

@jvwing org.json is not required. It was left in there from something

95cd79d

else I was planning on adding and instead removed and must have missed the dependency.

asfgit closed this in d05727b Mar 17, 2017

josephxsxn pushed a commit to josephxsxn/nifi that referenced this pull request Mar 23, 2017

NIFI-2613 Apache POI processor for Excel to CSV

d5bf684

Signed-off-by: James Wing <jvwing@gmail.com> This closes apache#929.

aperepel pushed a commit to aperepel/nifi that referenced this pull request Mar 29, 2017

NIFI-2613 Apache POI processor for Excel to CSV

00a6e32

Signed-off-by: James Wing <jvwing@gmail.com> This closes apache#929.

Conversation

jdye64 commented Aug 24, 2016

Uh oh!

jvwing commented Oct 13, 2016

Uh oh!

jvwing commented Oct 13, 2016

Uh oh!

jvwing commented Oct 18, 2016

Uh oh!

jvwing commented Oct 18, 2016

Uh oh!

joewitt commented Feb 14, 2017

Uh oh!

jdye64 commented Feb 14, 2017

Uh oh!

joewitt commented Feb 14, 2017

Uh oh!

jvwing commented Feb 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jvwing commented Feb 16, 2017

Uh oh!

jdye64 commented Feb 16, 2017

Uh oh!

jdye64 commented Feb 24, 2017

Uh oh!

jvwing commented Feb 25, 2017

Uh oh!

jdye64 commented Feb 27, 2017

Uh oh!

jvwing commented Feb 27, 2017

Uh oh!

jvwing commented Feb 27, 2017

Uh oh!

jvwing commented Feb 28, 2017

Uh oh!

jdye64 commented Mar 15, 2017

Uh oh!

jvwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jvwing commented Mar 16, 2017

Uh oh!

jvwing commented Mar 17, 2017

Uh oh!

jvwing commented Mar 17, 2017

Uh oh!

jvwing commented Mar 17, 2017

Uh oh!

Reviewers

Assignees