XML-based parsers should not load external DTDs by default #3347

lewismc · 2021-09-23T18:09:43Z

Problem description

We recently received a improvement request for Any23 to optionally disable remote HTTP connections when resolving XML entities. Any23 utilizes rdf4j 3.1.2. The stack trace provided by the reporter indicates that org.eclipse.rdf4j.rio.trix.TriXParser parsing can lead to a hung thread (for about two minutes) with an open HTTP connection.
I am writing here to see if this is something we can configure in RDF4J or whether we need to go deeper into Xerces or even the SUN HttpClient. I am looking for some guidance.

Preferred solution

The test file is available for anyone interested in trying to reproduce this issue. I am looking for some guidance on where this configuration would actually be implemented. Thanks for any suggestions.

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

Nothing yet. Apart from studying the org.eclipse.rdf4j.rio.trix.TriXParser source code this is the first port of call. Thanks for anyone who is interested in this issue.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

abrokenjester · 2021-09-25T05:48:45Z

I am not entirely sure what is going on here - the RDF4J XML parsers (including the TriX parser) are configured to disallow processing external entities by default (since they're a security risk). See GH-1056. So this may be an edge case of some sort that we haven't covered, or perhaps Any23 has tweaked the parser configuration to "re-allow" external entities somehow?

The RDF4J parser tweak this, by the way, by setting the appropriate feature flags on the JAXP SAXAdapter. Here's a good overview of the various settings: https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet#JAXP_DocumentBuilderFactory.2C_SAXParserFactory_and_DOM4J

The attached test file is an html file by the way, which makes it a bit tricky for me to trace what is going on - presumably Any23 extracts some RDF content from this file which it then passes on to the TriXParser to process? Is it possible for you to show what the content that the TriXParser tries to parse looks like?

lewismc · 2021-09-26T19:21:28Z

Thanks for taking a look @jeenbroekstra

or perhaps Any23 has tweaked the parser configuration to "re-allow" external entities somehow?

I studied the Any23 source code and CANNOT find any evidence of that being the case.

presumably Any23 extracts some RDF content from this file which it then passes on to the TriXParser to process?

Yes

Is it possible for you to show what the content that the TriXParser tries to parse looks like?

Yes, I can debug the code and provide that.

lewismc · 2021-10-04T21:03:53Z

Hi @jeenbroekstra I made slight progress. I wanted to update... I am working on this one :)

lewismc · 2021-10-08T19:53:37Z

@jeenbroekstra should org.eclipse.rdf4j.rio.trix.TriXParser even be activated on HTML documents? I need to admit I've never used Trix before so have literally no understanding of whether the parser implementation should be activated for the HTML media type. Thanks for any insight.

lewismc · 2021-10-08T19:55:14Z

For example, when I attempt to parse the test file I get the following partial stack trace

FATAL: 	'org.eclipse.rdf4j.rio.RDFParseException: The attribute name must be specified in the attribute-list declaration for element "charset". [line 181, column 45]
	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:333)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportFatalError(AbstractRDFParser.java:724)
	at org.eclipse.rdf4j.rio.trix.TriXParser.reportFatalError(TriXParser.java:253)
	at org.eclipse.rdf4j.rio.trix.TriXParser.fatalError(TriXParser.java:419)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanAttlistDecl(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
	at org.apa...' 	(-1,-1)

Thanks for any help in allowing me to better understand this.

abrokenjester · 2021-10-09T01:53:52Z

@jeenbroekstra should org.eclipse.rdf4j.rio.trix.TriXParser even be activated on HTML documents? I need to admit I've never used Trix before so have literally no understanding of whether the parser implementation should be activated for the HTML media type. Thanks for any insight.

The TriXParser itself is strictly limited to the TriX format, which is a structured XML format for RDF. It certainly won't be able to deal with HTML documents, and should not be used to process those directly.

I am not entirely sure how Any23 picks the parser implementation to use, but if it's using any of the utility methods in org.eclipse.rdf4j.rio.Rio for selecting a parser given a media type, I would not expect it to return the TriXParser if you input a html media type (in fact, I would expect it to return an empty result, as none of RDF4J's built-in parsers can process HTML documents)..

abrokenjester · 2021-10-09T02:01:25Z

For example, when I attempt to parse the test file I get the following partial stack trace

FATAL: 	'org.eclipse.rdf4j.rio.RDFParseException: The attribute name must be specified in the attribute-list declaration for element "charset". [line 181, column 45]
	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:333)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportFatalError(AbstractRDFParser.java:724)
	at org.eclipse.rdf4j.rio.trix.TriXParser.reportFatalError(TriXParser.java:253)
	at org.eclipse.rdf4j.rio.trix.TriXParser.fatalError(TriXParser.java:419)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanAttlistDecl(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
	at org.apa...' 	(-1,-1)

Thanks for any help in allowing me to better understand this.

That's roughly what I would expect to happen. The test file is an HTML file, not a TriX data file. The TriXParser makes an attempt to parse it as TriX format (which it can do to some extent, given that both HTML and TriX are XML-based formats) but quickly fails because it doesn't encounter any of the expected elements and attributes.

lewismc · 2021-10-13T17:55:41Z

Hi @jeenbroekstra based on your input above, I created a more suitable unit test which DOESN'T fix a bug neither does it verify the presence of a bug. Instead it demonstrates that the TriXParser should NOT be called when processing the test file BBC_News_Scotland.html. I have yet to figure out the root of why the TriXParser was activated whilst processing the sample document... I'll keep this issue open and report back when I/we figure that out.

lewismc · 2021-10-13T17:55:58Z

Thanks for providing the clarity :)

lewismc · 2021-10-19T15:56:06Z

There is a mechanism whereby the Any23 constructor can be overloaded with custom configuration and one or more user-defined extractors i.e.,

Any23 any23 = new Any23(new Configuration(...), "rdf-xml");

I was able to establish (from the original reporter) that Any23 was constructed that way based on client input via a 3rd party CLI tool.

The question therefore comes down to the following.

Is it the expected behavior that the (TriXParser) parser reads a remote DTD? - even if it is called because of a misconfiguration.

I will continue working on the Any23 layer to try and prevent this from happening...

@jeenbroekstra, depending on your response, I kindly ask you to resolve this issue. Thank you for the input.

abrokenjester · 2021-10-20T07:48:03Z

The TriXParser's underlying SAX2 parser (usually Xerces) should be configured, by default, to not read remote DTDs. This behavior can be overridden from the RDF4J side by tweaking the XMLParserSettings.LOAD_EXTERNAL_DTD option, or by setting the system property http://apache.org/xml/features/nonvalidating/load-external-dtd to true.

However, I've just done a quick unit test at my end and it appears there is a regression in the default settings.

Long story short: you've discovered a bug in the TriXParser, thanks! And sorry it took so long for me to cotton on.

The short-term workaround in the Any23 code is to explicitly disable loading of external DTDs on the TriXParser:

parser.getParserConfig().set(XMLParserSettings.LOAD_EXTERNAL_DTD, false);

We'll use this issue as a bug report to track a fix.

lewismc · 2021-10-20T16:26:11Z

Thank you @jeenbroekstra

lewismc · 2021-11-04T05:05:13Z

@jeenbroekstra excellent!
Which version of rdf4j can wwe expect to see this ship in?
Thank you

abrokenjester · 2021-11-04T11:28:30Z

It's scheduled for 3.7.4, which we will likely release in a week or so.

lewismc added the 📶 enhancement issue is a new feature or improvement label Sep 23, 2021

github-actions bot added this to 📥 Inbox in Project Progress Sep 23, 2021

lewismc mentioned this issue Oct 8, 2021

ANY23-504 XML-based parsers should not load external DTDs by default apache/any23#205

Merged

abrokenjester self-assigned this Oct 20, 2021

abrokenjester added 🐞 bug issue is a bug 📦 rio affects the Rio RDF Parser/Writer toolkit and removed 📶 enhancement issue is a new feature or improvement labels Oct 20, 2021

abrokenjester moved this from 📥 Inbox to 📋 Backlog in Project Progress Oct 20, 2021

abrokenjester added this to the 3.7.4 milestone Oct 20, 2021

abrokenjester added the security label Oct 20, 2021

abrokenjester changed the title ~~Optionally disable remote HTTP connections when resolving XML entities~~ XML-based parsers should not load external DTDs by default Oct 20, 2021

abrokenjester moved this from 📋 Backlog to 🚧 In progress in Project Progress Oct 20, 2021

abrokenjester added a commit that referenced this issue Oct 20, 2021

GH-3347 make LOAD_EXTERNAL_DTD compulsory so default is set to false

309d263

abrokenjester mentioned this issue Oct 20, 2021

GH-3347 make LOAD_EXTERNAL_DTD compulsory so default is set to false #3368

Merged

5 tasks

abrokenjester added a commit that referenced this issue Oct 28, 2021

GH-3347 make LOAD_EXTERNAL_DTD compulsory so default is set to false

8f88804

abrokenjester closed this as completed in #3368 Oct 30, 2021

Project Progress automation moved this from 🚧 In progress to 🥳 Done Oct 30, 2021

abrokenjester added a commit that referenced this issue Oct 30, 2021

GH-3347 make LOAD_EXTERNAL_DTD compulsory so default is set to false

35df3bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML-based parsers should not load external DTDs by default #3347

XML-based parsers should not load external DTDs by default #3347

lewismc commented Sep 23, 2021

abrokenjester commented Sep 25, 2021

lewismc commented Sep 26, 2021

lewismc commented Oct 4, 2021

lewismc commented Oct 8, 2021

lewismc commented Oct 8, 2021

abrokenjester commented Oct 9, 2021

abrokenjester commented Oct 9, 2021

lewismc commented Oct 13, 2021

lewismc commented Oct 13, 2021

lewismc commented Oct 19, 2021

abrokenjester commented Oct 20, 2021

lewismc commented Oct 20, 2021

lewismc commented Nov 4, 2021

abrokenjester commented Nov 4, 2021

XML-based parsers should not load external DTDs by default #3347

XML-based parsers should not load external DTDs by default #3347

Comments

lewismc commented Sep 23, 2021

Problem description

Preferred solution

Are you interested in contributing a solution yourself?

Alternatives you've considered

Anything else?

abrokenjester commented Sep 25, 2021

lewismc commented Sep 26, 2021

lewismc commented Oct 4, 2021

lewismc commented Oct 8, 2021

lewismc commented Oct 8, 2021

abrokenjester commented Oct 9, 2021

abrokenjester commented Oct 9, 2021

lewismc commented Oct 13, 2021

lewismc commented Oct 13, 2021

lewismc commented Oct 19, 2021

abrokenjester commented Oct 20, 2021

lewismc commented Oct 20, 2021

lewismc commented Nov 4, 2021

abrokenjester commented Nov 4, 2021