Skip to content

TIKA-4728 - fix xhtml in widgets#2817

Draft
tballison wants to merge 4 commits into
mainfrom
TIKA-4728-js-in-pdf
Draft

TIKA-4728 - fix xhtml in widgets#2817
tballison wants to merge 4 commits into
mainfrom
TIKA-4728-js-in-pdf

Conversation

@tballison
Copy link
Copy Markdown
Contributor

Thanks for your contribution to Apache Tika! Your help is appreciated!

Before opening the pull request, please verify that

  • there is an open issue on the Tika issue tracker which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
  • the issue ID (TIKA-XXXX)
    • is referenced in the title of the pull request
    • and placed in front of your commit messages surrounded by square brackets ([TIKA-XXXX] Issue or pull request title)
  • commits are squashed into a single one (or few commits for larger changes)
  • Tika is successfully built and unit tests pass by running ./mvnw clean test
  • there should be no conflicts when merging the pull request branch into the recent main branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled main branch
  • if you add new module that downstream users will depend upon add it to relevant group in tika-bom/pom.xml.

We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!

@tballison tballison marked this pull request as draft May 14, 2026 16:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (TIKA-4728) focuses on preventing malformed XHTML output from multiple parsers/handlers by balancing SAX events (start/end document, element stack) and avoiding invalid constructs like duplicate attributes. It also introduces an optional strict XHTML validator to surface well-formedness problems at the exact offending SAX event (particularly in tests).

Changes:

  • Add StrictXHTMLValidator and a BasicContentHandlerFactory option to validate XHTML well-formedness at SAX-event time; enable validation in TikaTest helpers.
  • Fix/guard XHTML emission across several parsers and extractors (ensure startDocument/endDocument, close open tags on partial parses/write-limit/SAX failures).
  • Address specific well-formedness issues such as duplicate attributes in PDF action output and wrapper HTML leakage from source-code highlighting.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tika-parsers/.../xliff/XLIFF12Parser.java Explicitly starts/ends XHTML document around SAX parse.
tika-parsers/.../tmx/TMXParser.java Explicitly starts/ends XHTML document around SAX parse.
tika-parsers/.../txt/TXTParser.java Ensures <p> and document are closed even on upstream exceptions.
tika-parsers/.../pdf/PDFParserTest.java Adds regression coverage for XHTML well-formedness when extracting actions.
tika-parsers/.../pdf/AbstractPDF2XHTML.java Avoids duplicate attributes by replacing existing ones when needed.
tika-parsers/.../odf/OpenDocumentBodyHandler.java Prevents cross-nested style tags around anchors / block-level structures.
tika-parsers/.../epub/EpubParser.java Drains open elements after swallowed SAXExceptions; de-dupes rewritten attributes.
tika-parsers/.../ooxml/XSSFExcelExtractorDecorator.java Tracks and closes pending <tr>/<td> on partial SAX failures.
tika-parsers/.../ooxml/SXWPFWordExtractorDecorator.java Closes any pending XHTML opened by the body handler after exceptions.
tika-parsers/.../ooxml/SXSLFPowerPointExtractorDecorator.java Closes pending XHTML after slide SAX failures to keep output balanced.
tika-parsers/.../ooxml/OOXMLWordAndPowerPointTextHandler.java Resets per-paragraph state to prevent unbalanced <p> events on nested paragraphs.
tika-parsers/.../ooxml/OOXMLTikaBodyPartHandler.java Adds cleanup API to close any pending table/paragraph/formatting tags.
tika-parsers/.../ooxml/AbstractOOXMLExtractor.java Swallows per-part SAXExceptions to avoid breaking the enclosing <div> scope.
tika-parsers/.../code/SourceCodeParserTest.java Updates expectations around wrapper attributes in highlighted output.
tika-parsers/.../code/SourceCodeParser.java Drops wrapper <html>/<body> elements and skips <head> subtree to prevent XHTML pollution.
tika-parsers/.../prt/PRTParser.java Adds XHTML startDocument/endDocument around extracted content.
tika-parsers/.../iwork/PagesContentHandler.java Closes/reopens <p> across page boundary events to avoid unbalanced tags.
tika-core/src/test/java/.../TikaTest.java Enables strict XHTML validation in core test helpers and recursive parsing helpers.
tika-core/src/main/java/.../StrictXHTMLValidator.java New strict SAX decorator to validate XHTML well-formedness (stack, attrs, doc lifecycle).
tika-core/src/main/java/.../BasicContentHandlerFactory.java Adds validateXHTML option to wrap produced handlers with StrictXHTMLValidator.
tika-core/src/main/java/.../ParsingEmbeddedDocumentExtractor.java Ensures enclosing <div> is closed even when embedded parse throws.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1540 to +1544
// TIKA-XXXX: handleDestinationOrAction pre-populated class/type on the action div,
// then processJavaScriptAction appended a second class/type for PDActionJavaScript
// actions, producing a div with duplicate attributes that SAX parsers reject.
// TikaTest.getXML wraps with StrictXHTMLValidator, so a regression makes
// this test throw at the offending SAX event.
Comment on lines 79 to 81
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, context);
xhtml.startDocument();
Last5 l5 = new Last5();
*/
public void closeAnyPending() throws SAXException {
formattingTags.closeAll();
if (tableCellDepth > 0) {
Comment on lines +664 to +668
void drainOpen() throws SAXException {
while (!openElements.isEmpty()) {
String el = openElements.pop();
super.endElement("", el, el);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants