Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML-based parsers should not load external DTDs by default #3347

Closed
lewismc opened this issue Sep 23, 2021 · 14 comments · Fixed by #3368
Closed

XML-based parsers should not load external DTDs by default #3347

lewismc opened this issue Sep 23, 2021 · 14 comments · Fixed by #3368
Assignees
Labels
🐞 bug issue is a bug 📦 rio affects the Rio RDF Parser/Writer toolkit security
Milestone

Comments

@lewismc
Copy link

lewismc commented Sep 23, 2021

Problem description

We recently received a improvement request for Any23 to optionally disable remote HTTP connections when resolving XML entities. Any23 utilizes rdf4j 3.1.2. The stack trace provided by the reporter indicates that org.eclipse.rdf4j.rio.trix.TriXParser parsing can lead to a hung thread (for about two minutes) with an open HTTP connection.
I am writing here to see if this is something we can configure in RDF4J or whether we need to go deeper into Xerces or even the SUN HttpClient. I am looking for some guidance.

Preferred solution

The test file is available for anyone interested in trying to reproduce this issue. I am looking for some guidance on where this configuration would actually be implemented. Thanks for any suggestions.

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

Nothing yet. Apart from studying the org.eclipse.rdf4j.rio.trix.TriXParser source code this is the first port of call. Thanks for anyone who is interested in this issue.

Anything else?

No response

@lewismc lewismc added the 📶 enhancement issue is a new feature or improvement label Sep 23, 2021
@github-actions github-actions bot added this to 📥 Inbox in Project Progress Sep 23, 2021
@abrokenjester
Copy link
Contributor

I am not entirely sure what is going on here - the RDF4J XML parsers (including the TriX parser) are configured to disallow processing external entities by default (since they're a security risk). See GH-1056. So this may be an edge case of some sort that we haven't covered, or perhaps Any23 has tweaked the parser configuration to "re-allow" external entities somehow?

The RDF4J parser tweak this, by the way, by setting the appropriate feature flags on the JAXP SAXAdapter. Here's a good overview of the various settings: https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet#JAXP_DocumentBuilderFactory.2C_SAXParserFactory_and_DOM4J

The attached test file is an html file by the way, which makes it a bit tricky for me to trace what is going on - presumably Any23 extracts some RDF content from this file which it then passes on to the TriXParser to process? Is it possible for you to show what the content that the TriXParser tries to parse looks like?

@lewismc
Copy link
Author

lewismc commented Sep 26, 2021

Thanks for taking a look @jeenbroekstra

or perhaps Any23 has tweaked the parser configuration to "re-allow" external entities somehow?

I studied the Any23 source code and CANNOT find any evidence of that being the case.

presumably Any23 extracts some RDF content from this file which it then passes on to the TriXParser to process?

Yes

Is it possible for you to show what the content that the TriXParser tries to parse looks like?

Yes, I can debug the code and provide that.

@lewismc
Copy link
Author

lewismc commented Oct 4, 2021

Hi @jeenbroekstra I made slight progress. I wanted to update... I am working on this one :)

@lewismc
Copy link
Author

lewismc commented Oct 8, 2021

@jeenbroekstra should org.eclipse.rdf4j.rio.trix.TriXParser even be activated on HTML documents? I need to admit I've never used Trix before so have literally no understanding of whether the parser implementation should be activated for the HTML media type. Thanks for any insight.

@lewismc
Copy link
Author

lewismc commented Oct 8, 2021

For example, when I attempt to parse the test file I get the following partial stack trace

FATAL: 	'org.eclipse.rdf4j.rio.RDFParseException: The attribute name must be specified in the attribute-list declaration for element "charset". [line 181, column 45]
	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:333)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportFatalError(AbstractRDFParser.java:724)
	at org.eclipse.rdf4j.rio.trix.TriXParser.reportFatalError(TriXParser.java:253)
	at org.eclipse.rdf4j.rio.trix.TriXParser.fatalError(TriXParser.java:419)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanAttlistDecl(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
	at org.apa...' 	(-1,-1)

Thanks for any help in allowing me to better understand this.

@abrokenjester
Copy link
Contributor

@jeenbroekstra should org.eclipse.rdf4j.rio.trix.TriXParser even be activated on HTML documents? I need to admit I've never used Trix before so have literally no understanding of whether the parser implementation should be activated for the HTML media type. Thanks for any insight.

The TriXParser itself is strictly limited to the TriX format, which is a structured XML format for RDF. It certainly won't be able to deal with HTML documents, and should not be used to process those directly.

I am not entirely sure how Any23 picks the parser implementation to use, but if it's using any of the utility methods in org.eclipse.rdf4j.rio.Rio for selecting a parser given a media type, I would not expect it to return the TriXParser if you input a html media type (in fact, I would expect it to return an empty result, as none of RDF4J's built-in parsers can process HTML documents)..

@abrokenjester
Copy link
Contributor

For example, when I attempt to parse the test file I get the following partial stack trace

FATAL: 	'org.eclipse.rdf4j.rio.RDFParseException: The attribute name must be specified in the attribute-list declaration for element "charset". [line 181, column 45]
	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:333)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportFatalError(AbstractRDFParser.java:724)
	at org.eclipse.rdf4j.rio.trix.TriXParser.reportFatalError(TriXParser.java:253)
	at org.eclipse.rdf4j.rio.trix.TriXParser.fatalError(TriXParser.java:419)
	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
	at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanAttlistDecl(Unknown Source)
	at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
	at org.apa...' 	(-1,-1)

Thanks for any help in allowing me to better understand this.

That's roughly what I would expect to happen. The test file is an HTML file, not a TriX data file. The TriXParser makes an attempt to parse it as TriX format (which it can do to some extent, given that both HTML and TriX are XML-based formats) but quickly fails because it doesn't encounter any of the expected elements and attributes.

@lewismc
Copy link
Author

lewismc commented Oct 13, 2021

Hi @jeenbroekstra based on your input above, I created a more suitable unit test which DOESN'T fix a bug neither does it verify the presence of a bug. Instead it demonstrates that the TriXParser should NOT be called when processing the test file BBC_News_Scotland.html. I have yet to figure out the root of why the TriXParser was activated whilst processing the sample document... I'll keep this issue open and report back when I/we figure that out.

@lewismc
Copy link
Author

lewismc commented Oct 13, 2021

Thanks for providing the clarity :)

@lewismc
Copy link
Author

lewismc commented Oct 19, 2021

There is a mechanism whereby the Any23 constructor can be overloaded with custom configuration and one or more user-defined extractors i.e.,

Any23 any23 = new Any23(new Configuration(...), "rdf-xml");

I was able to establish (from the original reporter) that Any23 was constructed that way based on client input via a 3rd party CLI tool.

The question therefore comes down to the following.

Is it the expected behavior that the (TriXParser) parser reads a remote DTD? - even if it is called because of a misconfiguration.

I will continue working on the Any23 layer to try and prevent this from happening...

@jeenbroekstra, depending on your response, I kindly ask you to resolve this issue. Thank you for the input.

@abrokenjester
Copy link
Contributor

The TriXParser's underlying SAX2 parser (usually Xerces) should be configured, by default, to not read remote DTDs. This behavior can be overridden from the RDF4J side by tweaking the XMLParserSettings.LOAD_EXTERNAL_DTD option, or by setting the system property http://apache.org/xml/features/nonvalidating/load-external-dtd to true.

However, I've just done a quick unit test at my end and it appears there is a regression in the default settings.

Long story short: you've discovered a bug in the TriXParser, thanks! And sorry it took so long for me to cotton on.

The short-term workaround in the Any23 code is to explicitly disable loading of external DTDs on the TriXParser:

parser.getParserConfig().set(XMLParserSettings.LOAD_EXTERNAL_DTD, false);

We'll use this issue as a bug report to track a fix.

@abrokenjester abrokenjester self-assigned this Oct 20, 2021
@abrokenjester abrokenjester added 🐞 bug issue is a bug 📦 rio affects the Rio RDF Parser/Writer toolkit and removed 📶 enhancement issue is a new feature or improvement labels Oct 20, 2021
@abrokenjester abrokenjester moved this from 📥 Inbox to 📋 Backlog in Project Progress Oct 20, 2021
@abrokenjester abrokenjester added this to the 3.7.4 milestone Oct 20, 2021
@abrokenjester abrokenjester changed the title Optionally disable remote HTTP connections when resolving XML entities XML-based parsers should not load external DTDs by default Oct 20, 2021
@abrokenjester abrokenjester moved this from 📋 Backlog to 🚧 In progress in Project Progress Oct 20, 2021
@lewismc
Copy link
Author

lewismc commented Oct 20, 2021

Thank you @jeenbroekstra

@lewismc
Copy link
Author

lewismc commented Nov 4, 2021

@jeenbroekstra excellent!
Which version of rdf4j can wwe expect to see this ship in?
Thank you

@abrokenjester
Copy link
Contributor

It's scheduled for 3.7.4, which we will likely release in a week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug 📦 rio affects the Rio RDF Parser/Writer toolkit security
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants