Utilize htmlentities gem for decoding/encoding #65

krasnoukhov · 2014-11-20T17:39:38Z

HTML entity decoding and encoding is hard. There are so many different symbols. Also HTML entities can be presented as named strings (") and numerical (").

It was nice trying to handcraft decoding/encoding of these, but it appears that it would be much better to use standalone library, which handles that complexity for us.

yorickpeterse · 2014-11-20T19:01:35Z

I looked into this in the past, in fact it was one of my original ideas to use this Gem (see #49 (comment)). At that time I decided not to use it due to the sheer complexity of the library and the massive performance overhead it introduces.

As an example, take the benchmark benchmark/xml/lexer/string_average_bench.rb. Running this on the current master branch produces the following timings:

Iteration: 1: 0.677
Iteration: 2: 0.671
Iteration: 3: 0.671
Iteration: 4: 0.671
Iteration: 5: 0.674
Iteration: 6: 0.671
Iteration: 7: 0.67
Iteration: 8: 0.672
Iteration: 9: 0.675
Iteration: 10: 0.671

Iterations: 10
Average:    0.672 sec

Applying your patch and re-running the benchmark in turn results in the following:

Iteration: 1: 1.056
Iteration: 2: 1.051
Iteration: 3: 1.051
Iteration: 4: 1.054
Iteration: 5: 1.054
Iteration: 6: 1.08
Iteration: 7: 1.082
Iteration: 8: 1.079
Iteration: 9: 1.059
Iteration: 10: 1.085

Iterations: 10
Average:    1.065 sec

In other words, the change makes Oga's lexer about 1,6 times slower than before. At least memory usage is still consistent (measured using profile/xml/lexer/big_xml.rb).

Looking at all the things the htmlentities Gem is doing this isn't entirely surprising it slows things down. For example:

For the last example I sure as hell hope those instance_eval calls are only executed once as its going to make performance suck.

Considering I wanted the next release to be faster and not slower I'm not a huge fan of this PR.

As an aside, there are two other things worth mentioning.

First, the usage of CODER = HTMLEntities.new is not safe at all. While it looks like the HTMLEntities class is thread-safe (https://github.com/threedaymonk/htmlentities/blob/a68cc5ce9ad6ffaa06623a97d319ead4238bd1a0/lib/htmlentities.rb) there's no guarantee it will stay that way. Basically a requirement that I have with Oga, which I should add to the contributing guide, is no usage of global state/objects/whatever unless they are designed to be thread-safe (and can prove this).

Second, require calls should always go in lib/oga.rb and not in other random files. This too I should add to the contributing guide.

To summarize, if there was no extra overhead attached to this library I wouldn't really have any problems with it. I don't want to slow Oga down for everybody by default, especially if the feature in question is one I myself don't really rely on.

Performance wise my goal is to get Oga running at about ~15 MB of XML per second (both lexing and parsing) with the theoretical maximum being 20MB at the moment (due to Ruby's baseline overhead). Until I've achieved that (or something close to it) I can't afford to add anything that slows Oga down. I'll keep this PR open for the time being as I might merge it given I can achieve the above.

yorickpeterse · 2014-11-20T19:03:00Z

Worth mentioning: the impact of this is huge due to HTML/XML entity decoding happening for every T_TEXT and T_STRING_BODY node the lexer produces.

yorickpeterse · 2014-11-20T19:12:09Z

I added the two notes mentioned above to the contributing guide in 81c49b5.

krasnoukhov · 2014-11-20T19:50:24Z

Hey, thank you for your time you're spending on this issue. I fully understand your concerns and appreciate that.

The reason why I'm working on this is that I want people to be able to use oga as a sax-machine handler. Actually I've already released a new version of it with oga support. There is another library, called feedjira which uses sax-machine and I believe it has users who could benefit from using oga as underlying handler.

So, while sax-machine specs are passing w/ oga, there are failing specs on feedjira repo and all of them are related to this issue. Please take a look on this Travis build for more details. Frankly speaking, current solution with html entities is not standard-compliant, so I'd consider it as a bug. Currently people can't rely on oga for correct HTML entities decoding/encoding, and this is quite bad. Sure, it works if they don't care about HTML inside the XML they parse. But are you sure that 100% of your users do that?

Ok, back to the issue. Here are string_average_bench.rb results on master on my machine:

Iteration: 1: 0.928
Iteration: 2: 0.938
Iteration: 3: 0.934
Iteration: 4: 0.924
Iteration: 5: 1.312
Iteration: 6: 0.918
Iteration: 7: 0.925
Iteration: 8: 0.995
Iteration: 9: 1.045
Iteration: 10: 0.937

Iterations: 10
Average:    0.986 sec

After optimization with & is applied:

Iteration: 1: 0.936
Iteration: 2: 0.934
Iteration: 3: 0.909
Iteration: 4: 0.939
Iteration: 5: 0.928
Iteration: 6: 0.971
Iteration: 7: 0.943
Iteration: 8: 0.926
Iteration: 9: 0.92
Iteration: 10: 1.043

Iterations: 10
Average:    0.945 sec

You were right, we can benefit in speed greatly by using this simple trick!

After removing global instance:

Iteration: 1: 0.943
Iteration: 2: 0.972
Iteration: 3: 1.564
Iteration: 4: 1.001
Iteration: 5: 1.035
Iteration: 6: 0.946
Iteration: 7: 0.921
Iteration: 8: 0.999
Iteration: 9: 0.958
Iteration: 10: 0.929

Iterations: 10
Average:    1.027 sec

This one slows things a bit comparing to the global instance option. However, it appears only about 4-5% slower than master. Let me know what you think!

krasnoukhov · 2014-11-20T19:57:05Z

Also, running profile/xml/lexer/big_xml.rb on master shows me constant 20.0 MB and 23.0 MB on this branch.

yorickpeterse · 2014-11-20T20:19:34Z

Checking if & is indeed a trick that can speed things up when no entities are present. However, for input with entities I still expect a 1,5x decrease in performance.

As a start we should add a benchmark for lexing data with XML and HTML entities, but I'm not sure how to get a fixture of a decent size (10MB) for this. The current XML fixture was generated using xmlgen but that tool has no option to randomly include entities as far as I know.

An alternative to using the htmlentities Gem would be to simply rip the entity mapping from some website and use that similar to the existing entities (but only for HTML documents). This will still slow things down, but hopefully it's less than 1,5 times.

jrochkind · 2014-11-20T20:26:38Z

I wonder if this is something that makes sense to do with native code, in C and Java. HTML entity encoding/decoding is a very self-contained task, and a common enough one that there are probably high performance C and Java solutions. I don't know how/if that would fit into oga's codebase though.

yorickpeterse · 2014-11-20T20:36:51Z

@jrochkind It's certainly possible, but it would have to be duplicated for both C and Java. An alternative would be to define Ragel rules for every entity and stitch things back together in Ruby, I however doubt this is going to be that much faster.

krasnoukhov · 2014-11-20T21:36:15Z

and use that similar to the existing entities (but only for HTML documents)

@yorickpeterse, what do you mean by that? Why is it not used only for HTML documents now?

I agree with other suggestions. I'll try to do something in that regard this week.

yorickpeterse · 2014-11-20T21:52:07Z

what do you mean by that? Why is it not used only for HTML documents now?

What I meant is that HTML entity encoding/decoding should only occur in HTML documents, not XML documents.

krasnoukhov · 2014-11-20T21:55:18Z

I must disagree with that. The case I'm talking about is HTML data embedded (in encoded format) in XML document. If we change decoding occur only in HTML documents we'll loose even already existing basic support. Am I missing something?

jrochkind · 2014-11-20T22:21:56Z

Right, C and Java. "translating HTML entities" is a fairly well-defined problem, so I'm suspecting there is a high performance solution for both C and Java already existing, which both do the same thing. But I could be wrong.

jrochkind · 2014-11-20T22:23:44Z

@krasnoukhov For HTML data embedded in an XML document, you should not be expecting the parser to decode HTML entities. The parser has no way of knowing that that data is HTML. The parser's job is giving back to you the exact string literal that was in the XML text node (with XML escaping undone, the literal text not the XML serialization), no more no less.

If you as the consumer know that that data is actually HTML, and may have HTML entities that need to be decoded -- you can use the htmlentities gem to decode it yourself, pretty easily.

If the parser were decoding HTML entities found in text nodes of an XML document, I would consider that a bug.

yorickpeterse · 2014-11-20T22:27:33Z

@krasnoukhov XML only has 5 pre-defined entities listed here http://www.w3.org/TR/REC-xml/#sec-predefined-ent, of which apparently I've forgotten to add 2 to Oga (see #66). Oga is not going to decode/encode HTML entities in XML documents. However, encoding/decoding HTML entities in HTML documents is a different story.

@jrochkind There being a library for it is one thing, it being usable in Oga's case is an entirely different story. Even then I prefer to not do this in C/Java as it further complicates the codebase.

krasnoukhov · 2014-11-20T23:34:30Z

I see, thanks for the explanation. However, I'm not sure why both nokogiri and ox actually do decoding. Ideas?

Here is ox test that assures decoding: https://github.com/ohler55/ox/blob/3764d26ab44f3202508a0b5ef2f9944b68bfef75/test/tests.rb#L314-L319

But I wasn't able to find similar for nokogiri, however it still does that:

2.1.5 :001 > Nokogiri::XML('<foo>&#8217;</foo>').at('foo').text
 => "’"

yorickpeterse · 2014-11-20T23:47:51Z

@krasnoukhov This is described in http://www.w3.org/TR/REC-xml/#sec-references. Oga currently doesn't support this, though probably should in the future.

krasnoukhov · 2014-11-21T00:01:32Z

Ok, now everything is clear. This PR indeed does not look like a good solution to the problem documented in #68, so I'm closing in favor of future effort.

Utilize htmlentities gem for decoding/encoding

c644b11

yorickpeterse added Feature XML Lexer labels Nov 20, 2014

Entities: apply simple &-check optimization

833df97

krasnoukhov force-pushed the htmlentities branch from 9d742a4 to 833df97 Compare November 20, 2014 19:30

Entities: move require to the lib/oga.rb

52f53b2

krasnoukhov force-pushed the htmlentities branch from 2011aee to b0ec657 Compare November 20, 2014 19:44

Entities: get rid from global instance

59dbe4e

krasnoukhov force-pushed the htmlentities branch from b0ec657 to 59dbe4e Compare November 20, 2014 19:44

krasnoukhov mentioned this pull request Nov 20, 2014

Oga SAX handler feedjira/feedjira#254

Closed

krasnoukhov closed this Nov 21, 2014

krasnoukhov deleted the htmlentities branch November 21, 2014 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilize htmlentities gem for decoding/encoding #65

Utilize htmlentities gem for decoding/encoding #65

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

jrochkind commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

jrochkind commented Nov 20, 2014

jrochkind commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 21, 2014

Utilize htmlentities gem for decoding/encoding #65

Utilize htmlentities gem for decoding/encoding #65

Conversation

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

jrochkind commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

jrochkind commented Nov 20, 2014

jrochkind commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 20, 2014

yorickpeterse commented Nov 20, 2014

krasnoukhov commented Nov 21, 2014