-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML-based parsers should not load external DTDs by default #3347
Comments
I am not entirely sure what is going on here - the RDF4J XML parsers (including the TriX parser) are configured to disallow processing external entities by default (since they're a security risk). See GH-1056. So this may be an edge case of some sort that we haven't covered, or perhaps Any23 has tweaked the parser configuration to "re-allow" external entities somehow? The RDF4J parser tweak this, by the way, by setting the appropriate feature flags on the JAXP SAXAdapter. Here's a good overview of the various settings: https://www.owasp.org/index.php/XML_External_Entity_(XXE)_Prevention_Cheat_Sheet#JAXP_DocumentBuilderFactory.2C_SAXParserFactory_and_DOM4J The attached test file is an html file by the way, which makes it a bit tricky for me to trace what is going on - presumably Any23 extracts some RDF content from this file which it then passes on to the TriXParser to process? Is it possible for you to show what the content that the TriXParser tries to parse looks like? |
Thanks for taking a look @jeenbroekstra
I studied the Any23 source code and CANNOT find any evidence of that being the case.
Yes
Yes, I can debug the code and provide that. |
Hi @jeenbroekstra I made slight progress. I wanted to update... I am working on this one :) |
@jeenbroekstra should |
For example, when I attempt to parse the test file I get the following partial stack trace
Thanks for any help in allowing me to better understand this. |
The TriXParser itself is strictly limited to the TriX format, which is a structured XML format for RDF. It certainly won't be able to deal with HTML documents, and should not be used to process those directly. I am not entirely sure how Any23 picks the parser implementation to use, but if it's using any of the utility methods in |
That's roughly what I would expect to happen. The test file is an HTML file, not a TriX data file. The TriXParser makes an attempt to parse it as TriX format (which it can do to some extent, given that both HTML and TriX are XML-based formats) but quickly fails because it doesn't encounter any of the expected elements and attributes. |
Hi @jeenbroekstra based on your input above, I created a more suitable unit test which DOESN'T fix a bug neither does it verify the presence of a bug. Instead it demonstrates that the TriXParser should NOT be called when processing the test file BBC_News_Scotland.html. I have yet to figure out the root of why the TriXParser was activated whilst processing the sample document... I'll keep this issue open and report back when I/we figure that out. |
Thanks for providing the clarity :) |
There is a mechanism whereby the Any23 constructor can be overloaded with custom configuration and one or more user-defined extractors i.e.,
I was able to establish (from the original reporter) that Any23 was constructed that way based on client input via a 3rd party CLI tool. The question therefore comes down to the following.
I will continue working on the Any23 layer to try and prevent this from happening... @jeenbroekstra, depending on your response, I kindly ask you to resolve this issue. Thank you for the input. |
The TriXParser's underlying SAX2 parser (usually Xerces) should be configured, by default, to not read remote DTDs. This behavior can be overridden from the RDF4J side by tweaking the XMLParserSettings.LOAD_EXTERNAL_DTD option, or by setting the system property However, I've just done a quick unit test at my end and it appears there is a regression in the default settings. Long story short: you've discovered a bug in the TriXParser, thanks! And sorry it took so long for me to cotton on. The short-term workaround in the Any23 code is to explicitly disable loading of external DTDs on the TriXParser: parser.getParserConfig().set(XMLParserSettings.LOAD_EXTERNAL_DTD, false); We'll use this issue as a bug report to track a fix. |
Thank you @jeenbroekstra |
@jeenbroekstra excellent! |
It's scheduled for 3.7.4, which we will likely release in a week or so. |
Problem description
We recently received a improvement request for Any23 to optionally disable remote HTTP connections when resolving XML entities. Any23 utilizes rdf4j 3.1.2. The stack trace provided by the reporter indicates that
org.eclipse.rdf4j.rio.trix.TriXParser
parsing can lead to a hung thread (for about two minutes) with an open HTTP connection.I am writing here to see if this is something we can configure in RDF4J or whether we need to go deeper into Xerces or even the SUN HttpClient. I am looking for some guidance.
Preferred solution
The test file is available for anyone interested in trying to reproduce this issue. I am looking for some guidance on where this configuration would actually be implemented. Thanks for any suggestions.
Are you interested in contributing a solution yourself?
Yes
Alternatives you've considered
Nothing yet. Apart from studying the
org.eclipse.rdf4j.rio.trix.TriXParser
source code this is the first port of call. Thanks for anyone who is interested in this issue.Anything else?
No response
The text was updated successfully, but these errors were encountered: