-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilize htmlentities gem for decoding/encoding #65
Conversation
I looked into this in the past, in fact it was one of my original ideas to use this Gem (see #49 (comment)). At that time I decided not to use it due to the sheer complexity of the library and the massive performance overhead it introduces. As an example, take the benchmark
Applying your patch and re-running the benchmark in turn results in the following:
In other words, the change makes Oga's lexer about 1,6 times slower than before. At least memory usage is still consistent (measured using Looking at all the things the htmlentities Gem is doing this isn't entirely surprising it slows things down. For example:
For the last example I sure as hell hope those Considering I wanted the next release to be faster and not slower I'm not a huge fan of this PR. As an aside, there are two other things worth mentioning. First, the usage of Second, To summarize, if there was no extra overhead attached to this library I wouldn't really have any problems with it. I don't want to slow Oga down for everybody by default, especially if the feature in question is one I myself don't really rely on. Performance wise my goal is to get Oga running at about ~15 MB of XML per second (both lexing and parsing) with the theoretical maximum being 20MB at the moment (due to Ruby's baseline overhead). Until I've achieved that (or something close to it) I can't afford to add anything that slows Oga down. I'll keep this PR open for the time being as I might merge it given I can achieve the above. |
Worth mentioning: the impact of this is huge due to HTML/XML entity decoding happening for every |
I added the two notes mentioned above to the contributing guide in 81c49b5. |
9d742a4
to
833df97
Compare
2011aee
to
b0ec657
Compare
b0ec657
to
59dbe4e
Compare
Hey, thank you for your time you're spending on this issue. I fully understand your concerns and appreciate that. The reason why I'm working on this is that I want people to be able to use oga as a sax-machine handler. Actually I've already released a new version of it with oga support. There is another library, called feedjira which uses sax-machine and I believe it has users who could benefit from using oga as underlying handler. So, while sax-machine specs are passing w/ oga, there are failing specs on feedjira repo and all of them are related to this issue. Please take a look on this Travis build for more details. Frankly speaking, current solution with html entities is not standard-compliant, so I'd consider it as a bug. Currently people can't rely on oga for correct HTML entities decoding/encoding, and this is quite bad. Sure, it works if they don't care about HTML inside the XML they parse. But are you sure that 100% of your users do that? Ok, back to the issue. Here are
After optimization with
You were right, we can benefit in speed greatly by using this simple trick! After removing global instance:
This one slows things a bit comparing to the global instance option. However, it appears only about 4-5% slower than master. Let me know what you think! |
Also, running |
Checking if As a start we should add a benchmark for lexing data with XML and HTML entities, but I'm not sure how to get a fixture of a decent size (10MB) for this. The current XML fixture was generated using xmlgen but that tool has no option to randomly include entities as far as I know. An alternative to using the htmlentities Gem would be to simply rip the entity mapping from some website and use that similar to the existing entities (but only for HTML documents). This will still slow things down, but hopefully it's less than 1,5 times. |
I wonder if this is something that makes sense to do with native code, in C and Java. HTML entity encoding/decoding is a very self-contained task, and a common enough one that there are probably high performance C and Java solutions. I don't know how/if that would fit into oga's codebase though. |
@jrochkind It's certainly possible, but it would have to be duplicated for both C and Java. An alternative would be to define Ragel rules for every entity and stitch things back together in Ruby, I however doubt this is going to be that much faster. |
@yorickpeterse, what do you mean by that? Why is it not used only for HTML documents now? I agree with other suggestions. I'll try to do something in that regard this week. |
What I meant is that HTML entity encoding/decoding should only occur in HTML documents, not XML documents. |
I must disagree with that. The case I'm talking about is HTML data embedded (in encoded format) in XML document. If we change decoding occur only in HTML documents we'll loose even already existing basic support. Am I missing something? |
Right, C and Java. "translating HTML entities" is a fairly well-defined problem, so I'm suspecting there is a high performance solution for both C and Java already existing, which both do the same thing. But I could be wrong. |
@krasnoukhov For HTML data embedded in an XML document, you should not be expecting the parser to decode HTML entities. The parser has no way of knowing that that data is HTML. The parser's job is giving back to you the exact string literal that was in the XML text node (with XML escaping undone, the literal text not the XML serialization), no more no less. If you as the consumer know that that data is actually HTML, and may have HTML entities that need to be decoded -- you can use the htmlentities gem to decode it yourself, pretty easily. If the parser were decoding HTML entities found in text nodes of an XML document, I would consider that a bug. |
@krasnoukhov XML only has 5 pre-defined entities listed here http://www.w3.org/TR/REC-xml/#sec-predefined-ent, of which apparently I've forgotten to add 2 to Oga (see #66). Oga is not going to decode/encode HTML entities in XML documents. However, encoding/decoding HTML entities in HTML documents is a different story. @jrochkind There being a library for it is one thing, it being usable in Oga's case is an entirely different story. Even then I prefer to not do this in C/Java as it further complicates the codebase. |
I see, thanks for the explanation. However, I'm not sure why both nokogiri and ox actually do decoding. Ideas? Here is ox test that assures decoding: https://github.com/ohler55/ox/blob/3764d26ab44f3202508a0b5ef2f9944b68bfef75/test/tests.rb#L314-L319 But I wasn't able to find similar for nokogiri, however it still does that: 2.1.5 :001 > Nokogiri::XML('<foo>’</foo>').at('foo').text
=> "’" |
@krasnoukhov This is described in http://www.w3.org/TR/REC-xml/#sec-references. Oga currently doesn't support this, though probably should in the future. |
Ok, now everything is clear. This PR indeed does not look like a good solution to the problem documented in #68, so I'm closing in favor of future effort. |
HTML entity decoding and encoding is hard. There are so many different symbols. Also HTML entities can be presented as named strings (
"
) and numerical ("
).It was nice trying to handcraft decoding/encoding of these, but it appears that it would be much better to use standalone library, which handles that complexity for us.