NIFI-2613 - Apache POI processor to convert Excel documents to CSV#929
NIFI-2613 - Apache POI processor to convert Excel documents to CSV#929jdye64 wants to merge 10 commits intoapache:masterfrom jdye64:NIFI-2613
Conversation
|
Reviewing |
|
@jdye64 I have a couple of requests from a quick scan. Would it be possible to...
|
|
@jdye64 I still had trouble building and running NiFi with this PR.
Would you please try running |
|
I have a few more comments from running the processor and reviewing the code: Processor Annotations
Properties
Exception Handling and Logging |
nidi-assembly, and added ASLv2 notice to Apache POI
|
@jvwing wow sorry its been a long time for my response. I've tried to go through and make all of the cleanup suggestions you have made. Yes, the intention is a single output per sheet in the excel file. Keep in mind that if the sheet is not .xlsx and rather .xls format you would see the behavior you are experiencing. Can you attach the excel doc you were testing with? I made the change to case of the attribute Also adjusted the error handling to prevent the end user from being pummeled with a full stack trace. |
|
@jvwing you ok continuing to run with this one for the review? |
| Apache Kafka | ||
| Copyright 2012 The Apache Software Foundation. | ||
|
|
||
| (ASLv2) Apache POI |
There was a problem hiding this comment.
Thanks for adding the notice info.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| org.apache.nifi.processors.kite.StoreInKiteDataset | ||
| org.apache.nifi.processors.kite.ConvertAvroToParquet |
There was a problem hiding this comment.
I believe this is unnecessary or part of a different feature? Can this be removed from the PR?
| <type>nar</type> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>org.apache.nifi</groupId> |
| @CapabilityDescription("Consumes a Microsoft Excel document and converts each worksheet to csv. Each sheet from the incoming Excel " + | ||
| "document will generate a new Flowfile that will be output from this processor. Each output Flowfile's contents will be formatted as a csv file " + | ||
| "where the each row from the excel sheet is output as a newline in the csv file.") | ||
| @WritesAttributes({@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document"), |
There was a problem hiding this comment.
this particular row of data came from should be "this flowfile came from"?
| "document will generate a new Flowfile that will be output from this processor. Each output Flowfile's contents will be formatted as a csv file " + | ||
| "where the each row from the excel sheet is output as a newline in the csv file.") | ||
| @WritesAttributes({@WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document"), | ||
| @WritesAttribute(attribute="numrows", description="The number of rows in this Excel Sheet"), |
There was a problem hiding this comment.
Can we clarify if this is the number of rows in the input spreadsheet, or output rows in the flowfile?
| } | ||
|
|
||
| @Test | ||
| public void testProcessASpecificSheetThatDoesNotExist() throws Exception { |
There was a problem hiding this comment.
Should this be testProcessASpecificSheetThatDoesExist? The spreadsheet does have a "Scorecard" sheet, and the test asserts that rows are processed.
| } | ||
|
|
||
| /** | ||
| * We do want to allow this to be a success relationship because if arbitrary Excel |
There was a problem hiding this comment.
Seems like a cliffhanger comment, I would like to read the ending :)
|
@jdye64 I'm still confused by the disparity between the annotation docs and the behavior of the processor, especially with respect to multiple sheets in the incoming spreadsheet. From the annotations:
This was not my experience, and I believe the single call to
But it has always been I created a unit test to help clarify how this is supposed to work. From the docs, I believe the test below should pass, although it currently fails. Am I misunderstanding? Also see branch |
|
@jvwing thank you for pointing this out. Let me look it over right now and I'll make another commit shortly. I had this in another project and I think during the copy/paste I have made an error. |
|
@jvwing I indeed had used a much earlier version of the processor. I couldn't find the other version so just ended up doing a rewrite for the most part. Thanks for hanging with me on this. |
|
Thanks, @jdye64, this is clearly a big improvement. I will continue reviewing the functionality. Some of the issues identified above are still in the latest commit, and should still be addressed:
|
|
@jvwing the I'm unsure what your preference for reviewing is but let me know if this is getting to the point where you would like for me to squash the commits or not. |
|
@jdye64 Thanks for the update. I'll look into why I'm seeing the POM issue. Don't worry about squashing, I'll take care of that when we're ready to merge. |
|
@jdye64, I am still having trouble with the root How are you testing the build and the build output? I have experienced two scenarios: 1.) Clean System - Build fails at nifi-assembly, because the root pom.xml specifies nifi-poi-nar 1.0.0-SNAPSHOT, which is not in the local Maven repository:
2.) Reused System - Build succeeds because the older nifi-poi-nar-1.0.0-SNAPSHOT.nar still exists in the local Maven repository, but the output assembly files contain nifi-poi-nar-1.0.0-SNAPSHOT.nar. In neither case do I get the lastest NAR output. Would you please try deleting the contents of |
earlier and thought you were referencing /nifi-nar-bundles/pom.xml rather than /pom.xml. And your previous comment was def. correct as my build was not failing because of the previous Maven cache. I have corrected the version issue now. PS - Your a patient man =)
|
Thanks, @jdye64, that got it. I was able to build just fine with contrib-check and everything. On to new topics:
|
|
@jvwing In regards to error cases here are a list of things which could make this fail.
|
jvwing
left a comment
There was a problem hiding this comment.
I recommend including comments with the exception logging to provide some context for the exception.
| try { | ||
| outputStream.write(currentContent.getBytes()); | ||
| } catch (IOException e) { | ||
| e.printStackTrace(); |
| try { | ||
| outputStream.write(("," + currentContent).getBytes()); | ||
| } catch (IOException e) { | ||
| e.printStackTrace(); |
| rowCount++; | ||
| outputStream.write("\n".getBytes()); | ||
| } catch (IOException e) { | ||
| e.printStackTrace(); |
| try { | ||
| sheetInputStream.close(); | ||
| } catch (IOException ioe) { | ||
| //nothing further can be done... |
There was a problem hiding this comment.
Does it make sense to catch it here if we can't do anything? How about not catching the exception and letting the handlers in onTrigger get it?
| session.transfer(ff, SUCCESS); | ||
|
|
||
| } catch (SAXException saxE) { | ||
| getLogger().error(saxE.getMessage()); |
There was a problem hiding this comment.
Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).
| parser.parse(sheetSource); | ||
| sheetInputStream.close(); | ||
| } catch (Exception ex) { | ||
| getLogger().error(ex.getMessage()); |
There was a problem hiding this comment.
Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).
| }); | ||
|
|
||
| } catch (Exception ex) { | ||
| getLogger().error(ex.getMessage()); |
There was a problem hiding this comment.
Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).
| } | ||
| } | ||
| } catch (Exception ex) { | ||
| getLogger().error(ex.getMessage()); |
There was a problem hiding this comment.
Please include a comment and the full exception trace, like getLogger().error("Attempt to do X failed", ex).
|
@jdye64 Thanks, I looked into some of these error cases:
I recommend that you do the following:
|
|
Thanks for those improvements, @jdye64, I especially like the updated usage doc. Two things on the latest code:
Something similar happens with the session.putAttribute on ~209. As a result of these exceptions, the session is rolled back and the flowfile is returned to the input queue. I think we can throw an exception, though. So if we caught and rethrew with a different error message, it should work out.
I made a sample code fork with a unit test for .xls and a suggested approach to solving the IllegalStateExceptions, and the failure routing. I did not get the logging to cooperate the way I think it should, but we're not too far off. |
|
@jdye64, it looks good! I was just about to merge, giving the files a last once-over before I push the button... and I noticed we have a json.org dependency in the nifi-poi-processors POM: Are we using that? json.org is Category X these days, so we can't go with it. But it doesn't appear to be used, and when I deleted it I was able to build, test, and run the processor without obvious issues. Do you know of any reason I can't just pull that out? |
else I was planning on adding and instead removed and must have missed the dependency.
|
Thanks again for putting this together, @jdye64. |
Signed-off-by: James Wing <jvwing@gmail.com> This closes apache#929.
Signed-off-by: James Wing <jvwing@gmail.com> This closes apache#929.
No description provided.