Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize htmlentities gem for decoding/encoding #65

Closed
wants to merge 4 commits into from

Conversation

krasnoukhov
Copy link
Contributor

HTML entity decoding and encoding is hard. There are so many different symbols. Also HTML entities can be presented as named strings (") and numerical (").

It was nice trying to handcraft decoding/encoding of these, but it appears that it would be much better to use standalone library, which handles that complexity for us.

@yorickpeterse
Copy link
Owner

I looked into this in the past, in fact it was one of my original ideas to use this Gem (see #49 (comment)). At that time I decided not to use it due to the sheer complexity of the library and the massive performance overhead it introduces.

As an example, take the benchmark benchmark/xml/lexer/string_average_bench.rb. Running this on the current master branch produces the following timings:

Iteration: 1: 0.677
Iteration: 2: 0.671
Iteration: 3: 0.671
Iteration: 4: 0.671
Iteration: 5: 0.674
Iteration: 6: 0.671
Iteration: 7: 0.67
Iteration: 8: 0.672
Iteration: 9: 0.675
Iteration: 10: 0.671

Iterations: 10
Average:    0.672 sec

Applying your patch and re-running the benchmark in turn results in the following:

Iteration: 1: 1.056
Iteration: 2: 1.051
Iteration: 3: 1.051
Iteration: 4: 1.054
Iteration: 5: 1.054
Iteration: 6: 1.08
Iteration: 7: 1.082
Iteration: 8: 1.079
Iteration: 9: 1.059
Iteration: 10: 1.085

Iterations: 10
Average:    1.065 sec

In other words, the change makes Oga's lexer about 1,6 times slower than before. At least memory usage is still consistent (measured using profile/xml/lexer/big_xml.rb).

Looking at all the things the htmlentities Gem is doing this isn't entirely surprising it slows things down. For example:

For the last example I sure as hell hope those instance_eval calls are only executed once as its going to make performance suck.

Considering I wanted the next release to be faster and not slower I'm not a huge fan of this PR.

As an aside, there are two other things worth mentioning.

First, the usage of CODER = HTMLEntities.new is not safe at all. While it looks like the HTMLEntities class is thread-safe (https://github.com/threedaymonk/htmlentities/blob/a68cc5ce9ad6ffaa06623a97d319ead4238bd1a0/lib/htmlentities.rb) there's no guarantee it will stay that way. Basically a requirement that I have with Oga, which I should add to the contributing guide, is no usage of global state/objects/whatever unless they are designed to be thread-safe (and can prove this).

Second, require calls should always go in lib/oga.rb and not in other random files. This too I should add to the contributing guide.

To summarize, if there was no extra overhead attached to this library I wouldn't really have any problems with it. I don't want to slow Oga down for everybody by default, especially if the feature in question is one I myself don't really rely on.

Performance wise my goal is to get Oga running at about ~15 MB of XML per second (both lexing and parsing) with the theoretical maximum being 20MB at the moment (due to Ruby's baseline overhead). Until I've achieved that (or something close to it) I can't afford to add anything that slows Oga down. I'll keep this PR open for the time being as I might merge it given I can achieve the above.

@yorickpeterse
Copy link
Owner

Worth mentioning: the impact of this is huge due to HTML/XML entity decoding happening for every T_TEXT and T_STRING_BODY node the lexer produces.

@yorickpeterse
Copy link
Owner

I added the two notes mentioned above to the contributing guide in 81c49b5.

@krasnoukhov
Copy link
Contributor Author

Hey, thank you for your time you're spending on this issue. I fully understand your concerns and appreciate that.

The reason why I'm working on this is that I want people to be able to use oga as a sax-machine handler. Actually I've already released a new version of it with oga support. There is another library, called feedjira which uses sax-machine and I believe it has users who could benefit from using oga as underlying handler.

So, while sax-machine specs are passing w/ oga, there are failing specs on feedjira repo and all of them are related to this issue. Please take a look on this Travis build for more details. Frankly speaking, current solution with html entities is not standard-compliant, so I'd consider it as a bug. Currently people can't rely on oga for correct HTML entities decoding/encoding, and this is quite bad. Sure, it works if they don't care about HTML inside the XML they parse. But are you sure that 100% of your users do that?

Ok, back to the issue. Here are string_average_bench.rb results on master on my machine:

Iteration: 1: 0.928
Iteration: 2: 0.938
Iteration: 3: 0.934
Iteration: 4: 0.924
Iteration: 5: 1.312
Iteration: 6: 0.918
Iteration: 7: 0.925
Iteration: 8: 0.995
Iteration: 9: 1.045
Iteration: 10: 0.937

Iterations: 10
Average:    0.986 sec

After optimization with & is applied:

Iteration: 1: 0.936
Iteration: 2: 0.934
Iteration: 3: 0.909
Iteration: 4: 0.939
Iteration: 5: 0.928
Iteration: 6: 0.971
Iteration: 7: 0.943
Iteration: 8: 0.926
Iteration: 9: 0.92
Iteration: 10: 1.043

Iterations: 10
Average:    0.945 sec

You were right, we can benefit in speed greatly by using this simple trick!

After removing global instance:

Iteration: 1: 0.943
Iteration: 2: 0.972
Iteration: 3: 1.564
Iteration: 4: 1.001
Iteration: 5: 1.035
Iteration: 6: 0.946
Iteration: 7: 0.921
Iteration: 8: 0.999
Iteration: 9: 0.958
Iteration: 10: 0.929

Iterations: 10
Average:    1.027 sec

This one slows things a bit comparing to the global instance option. However, it appears only about 4-5% slower than master. Let me know what you think!

@krasnoukhov
Copy link
Contributor Author

Also, running profile/xml/lexer/big_xml.rb on master shows me constant 20.0 MB and 23.0 MB on this branch.

@yorickpeterse
Copy link
Owner

Checking if & is indeed a trick that can speed things up when no entities are present. However, for input with entities I still expect a 1,5x decrease in performance.

As a start we should add a benchmark for lexing data with XML and HTML entities, but I'm not sure how to get a fixture of a decent size (10MB) for this. The current XML fixture was generated using xmlgen but that tool has no option to randomly include entities as far as I know.

An alternative to using the htmlentities Gem would be to simply rip the entity mapping from some website and use that similar to the existing entities (but only for HTML documents). This will still slow things down, but hopefully it's less than 1,5 times.

@jrochkind
Copy link

I wonder if this is something that makes sense to do with native code, in C and Java. HTML entity encoding/decoding is a very self-contained task, and a common enough one that there are probably high performance C and Java solutions. I don't know how/if that would fit into oga's codebase though.

@yorickpeterse
Copy link
Owner

@jrochkind It's certainly possible, but it would have to be duplicated for both C and Java. An alternative would be to define Ragel rules for every entity and stitch things back together in Ruby, I however doubt this is going to be that much faster.

@krasnoukhov
Copy link
Contributor Author

and use that similar to the existing entities (but only for HTML documents)

@yorickpeterse, what do you mean by that? Why is it not used only for HTML documents now?

I agree with other suggestions. I'll try to do something in that regard this week.

@yorickpeterse
Copy link
Owner

what do you mean by that? Why is it not used only for HTML documents now?

What I meant is that HTML entity encoding/decoding should only occur in HTML documents, not XML documents.

@krasnoukhov
Copy link
Contributor Author

I must disagree with that. The case I'm talking about is HTML data embedded (in encoded format) in XML document. If we change decoding occur only in HTML documents we'll loose even already existing basic support. Am I missing something?

@jrochkind
Copy link

Right, C and Java. "translating HTML entities" is a fairly well-defined problem, so I'm suspecting there is a high performance solution for both C and Java already existing, which both do the same thing. But I could be wrong.

@jrochkind
Copy link

@krasnoukhov For HTML data embedded in an XML document, you should not be expecting the parser to decode HTML entities. The parser has no way of knowing that that data is HTML. The parser's job is giving back to you the exact string literal that was in the XML text node (with XML escaping undone, the literal text not the XML serialization), no more no less.

If you as the consumer know that that data is actually HTML, and may have HTML entities that need to be decoded -- you can use the htmlentities gem to decode it yourself, pretty easily.

If the parser were decoding HTML entities found in text nodes of an XML document, I would consider that a bug.

@yorickpeterse
Copy link
Owner

@krasnoukhov XML only has 5 pre-defined entities listed here http://www.w3.org/TR/REC-xml/#sec-predefined-ent, of which apparently I've forgotten to add 2 to Oga (see #66). Oga is not going to decode/encode HTML entities in XML documents. However, encoding/decoding HTML entities in HTML documents is a different story.

@jrochkind There being a library for it is one thing, it being usable in Oga's case is an entirely different story. Even then I prefer to not do this in C/Java as it further complicates the codebase.

@krasnoukhov
Copy link
Contributor Author

I see, thanks for the explanation. However, I'm not sure why both nokogiri and ox actually do decoding. Ideas?

Here is ox test that assures decoding: https://github.com/ohler55/ox/blob/3764d26ab44f3202508a0b5ef2f9944b68bfef75/test/tests.rb#L314-L319

But I wasn't able to find similar for nokogiri, however it still does that:

2.1.5 :001 > Nokogiri::XML('<foo>&#8217;</foo>').at('foo').text
 => "’" 

@yorickpeterse
Copy link
Owner

@krasnoukhov This is described in http://www.w3.org/TR/REC-xml/#sec-references. Oga currently doesn't support this, though probably should in the future.

@krasnoukhov
Copy link
Contributor Author

Ok, now everything is clear. This PR indeed does not look like a good solution to the problem documented in #68, so I'm closing in favor of future effort.

@krasnoukhov krasnoukhov deleted the htmlentities branch November 21, 2014 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants