TIKA-4728 - fix xhtml in widgets#2817
Draft
tballison wants to merge 4 commits into
Draft
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR (TIKA-4728) focuses on preventing malformed XHTML output from multiple parsers/handlers by balancing SAX events (start/end document, element stack) and avoiding invalid constructs like duplicate attributes. It also introduces an optional strict XHTML validator to surface well-formedness problems at the exact offending SAX event (particularly in tests).
Changes:
- Add
StrictXHTMLValidatorand aBasicContentHandlerFactoryoption to validate XHTML well-formedness at SAX-event time; enable validation inTikaTesthelpers. - Fix/guard XHTML emission across several parsers and extractors (ensure
startDocument/endDocument, close open tags on partial parses/write-limit/SAX failures). - Address specific well-formedness issues such as duplicate attributes in PDF action output and wrapper HTML leakage from source-code highlighting.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tika-parsers/.../xliff/XLIFF12Parser.java | Explicitly starts/ends XHTML document around SAX parse. |
| tika-parsers/.../tmx/TMXParser.java | Explicitly starts/ends XHTML document around SAX parse. |
| tika-parsers/.../txt/TXTParser.java | Ensures <p> and document are closed even on upstream exceptions. |
| tika-parsers/.../pdf/PDFParserTest.java | Adds regression coverage for XHTML well-formedness when extracting actions. |
| tika-parsers/.../pdf/AbstractPDF2XHTML.java | Avoids duplicate attributes by replacing existing ones when needed. |
| tika-parsers/.../odf/OpenDocumentBodyHandler.java | Prevents cross-nested style tags around anchors / block-level structures. |
| tika-parsers/.../epub/EpubParser.java | Drains open elements after swallowed SAXExceptions; de-dupes rewritten attributes. |
| tika-parsers/.../ooxml/XSSFExcelExtractorDecorator.java | Tracks and closes pending <tr>/<td> on partial SAX failures. |
| tika-parsers/.../ooxml/SXWPFWordExtractorDecorator.java | Closes any pending XHTML opened by the body handler after exceptions. |
| tika-parsers/.../ooxml/SXSLFPowerPointExtractorDecorator.java | Closes pending XHTML after slide SAX failures to keep output balanced. |
| tika-parsers/.../ooxml/OOXMLWordAndPowerPointTextHandler.java | Resets per-paragraph state to prevent unbalanced <p> events on nested paragraphs. |
| tika-parsers/.../ooxml/OOXMLTikaBodyPartHandler.java | Adds cleanup API to close any pending table/paragraph/formatting tags. |
| tika-parsers/.../ooxml/AbstractOOXMLExtractor.java | Swallows per-part SAXExceptions to avoid breaking the enclosing <div> scope. |
| tika-parsers/.../code/SourceCodeParserTest.java | Updates expectations around wrapper attributes in highlighted output. |
| tika-parsers/.../code/SourceCodeParser.java | Drops wrapper <html>/<body> elements and skips <head> subtree to prevent XHTML pollution. |
| tika-parsers/.../prt/PRTParser.java | Adds XHTML startDocument/endDocument around extracted content. |
| tika-parsers/.../iwork/PagesContentHandler.java | Closes/reopens <p> across page boundary events to avoid unbalanced tags. |
| tika-core/src/test/java/.../TikaTest.java | Enables strict XHTML validation in core test helpers and recursive parsing helpers. |
| tika-core/src/main/java/.../StrictXHTMLValidator.java | New strict SAX decorator to validate XHTML well-formedness (stack, attrs, doc lifecycle). |
| tika-core/src/main/java/.../BasicContentHandlerFactory.java | Adds validateXHTML option to wrap produced handlers with StrictXHTMLValidator. |
| tika-core/src/main/java/.../ParsingEmbeddedDocumentExtractor.java | Ensures enclosing <div> is closed even when embedded parse throws. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1540
to
+1544
| // TIKA-XXXX: handleDestinationOrAction pre-populated class/type on the action div, | ||
| // then processJavaScriptAction appended a second class/type for PDActionJavaScript | ||
| // actions, producing a div with duplicate attributes that SAX parsers reject. | ||
| // TikaTest.getXML wraps with StrictXHTMLValidator, so a regression makes | ||
| // this test throw at the offending SAX event. |
Comment on lines
79
to
81
| XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, context); | ||
| xhtml.startDocument(); | ||
| Last5 l5 = new Last5(); |
| */ | ||
| public void closeAnyPending() throws SAXException { | ||
| formattingTags.closeAll(); | ||
| if (tableCellDepth > 0) { |
Comment on lines
+664
to
+668
| void drainOpen() throws SAXException { | ||
| while (!openElements.isEmpty()) { | ||
| String el = openElements.pop(); | ||
| super.endElement("", el, el); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution to Apache Tika! Your help is appreciated!
Before opening the pull request, please verify that
TIKA-XXXX)[TIKA-XXXX] Issue or pull request title)./mvnw clean testmainbranch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulledmainbranchtika-bom/pom.xml.We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the Tika mailing list. Thanks!