Skip to content

fix for TIKA-1840 contributed by zetisam#72

Merged
asfgit merged 2 commits intoapache:masterfrom
zetisam:TIKA-1840
Jan 24, 2016
Merged

fix for TIKA-1840 contributed by zetisam#72
asfgit merged 2 commits intoapache:masterfrom
zetisam:TIKA-1840

Conversation

@zetisam
Copy link
Contributor

@zetisam zetisam commented Jan 22, 2016

No description provided.

@Gagravarr
Copy link
Contributor

Looks to be some slightly odd indents there, any chance you could review http://tika.apache.org/contribute.html#Code_Formatting and fix?

Secondly, won't this patch cause us to get all the notes twice? Once with minimal stuff by the slide, again later on when the full notes extraction runs?

Might be good to review how the XSLF (.pptx) one does it, and crib from that?

@zetisam
Copy link
Contributor Author

zetisam commented Jan 22, 2016

Hi, I will fix the indentation. Editor was still setup to use tabs instead of spaces for another project.

This will indeed cause the notes to appear twice, once with the slide (as is currently also the case for PPTX), and once at the bottom. There might be people having projects built on the fact of having the slide notes at the end of the output, and I don't want to break functionality for them. I don't know what the project's stance is on this.

Additionally (but that's maybe another issue), the PPT output has each slide in a seperate <div class="slide"> block, while in the PPTX output this isn't the case. This is also something that could be unified, but again, I don't want to break existing behavior.

@asfgit asfgit merged commit 7d43bd7 into apache:master Jan 24, 2016
asfgit pushed a commit that referenced this pull request Jan 24, 2016
@Gagravarr
Copy link
Contributor

I don't think that including the text of the notes twice is good from the backwards compatibility standpoint either - it will mess up some people's rendering, along with text frequency stuff.

I think we should decide on the "right" set of markup for identifying slides and their associated notes, then fix both PPT and PPTX to follow this + log the change in the changelog to alert existing users. If you could review the output of the PPT and PPTX parsers html, and then make a suggestion for what seems sensible, that'd be great! Please post it to the dev list or TIKA-1840 so it gets enough visibilty

@zetisam
Copy link
Contributor Author

zetisam commented Jan 26, 2016

Hi Nick,

You're absolutely right. I didn't think of that scenario. I already made a separate issue on jira for the XML structure differences: TIKA-1841

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants