fix for TIKA-1840 contributed by zetisam#72
fix for TIKA-1840 contributed by zetisam#72asfgit merged 2 commits intoapache:masterfrom zetisam:TIKA-1840
Conversation
|
Looks to be some slightly odd indents there, any chance you could review http://tika.apache.org/contribute.html#Code_Formatting and fix? Secondly, won't this patch cause us to get all the notes twice? Once with minimal stuff by the slide, again later on when the full notes extraction runs? Might be good to review how the XSLF (.pptx) one does it, and crib from that? |
|
Hi, I will fix the indentation. Editor was still setup to use tabs instead of spaces for another project. This will indeed cause the notes to appear twice, once with the slide (as is currently also the case for PPTX), and once at the bottom. There might be people having projects built on the fact of having the slide notes at the end of the output, and I don't want to break functionality for them. I don't know what the project's stance is on this. Additionally (but that's maybe another issue), the PPT output has each slide in a seperate |
|
I don't think that including the text of the notes twice is good from the backwards compatibility standpoint either - it will mess up some people's rendering, along with text frequency stuff. I think we should decide on the "right" set of markup for identifying slides and their associated notes, then fix both PPT and PPTX to follow this + log the change in the changelog to alert existing users. If you could review the output of the PPT and PPTX parsers html, and then make a suggestion for what seems sensible, that'd be great! Please post it to the dev list or TIKA-1840 so it gets enough visibilty |
|
Hi Nick, You're absolutely right. I didn't think of that scenario. I already made a separate issue on jira for the XML structure differences: TIKA-1841 |
No description provided.