Documentation should explain when treetop parser is quadratic #31

cabo · 2018-01-18T13:23:34Z

When parsing a UTF-8 string, treetop's run time is quadratic with respect to the input size. When parsing BINARY ("ASCII-8BIT"), the run time is mostly linear. That behavior doesn't come up in small test cases and may surprise users.

(The reason is that treetop heavily uses indexing into a String instance, and that is generally linear in UTF-8 and constant-time in BINARY. Maybe there is a way to get the index caching in the Ruby String implementation to work for treetop, but it doesn't work for me.)

cjheath · 2018-01-19T01:22:19Z

Thank you for letting me know about the problem. I wasn't aware that Ruby strings had index caching. Can you please explore this possible solution further.

I find this frankly rather hilarious, given this discussion I had with Matz in 2002, over 15 years ago now!
https://groups.google.com/d/embed/msg/comp.lang.ruby/XcKQSG0MfRk/-WdVw3ZMmLMJ

longhotsummer · 2019-03-27T07:51:36Z

I'm interested in looking into this, too. Profiling with ruby-prof shows that 50% of parsing time is spent indexing into strings, mostly to compare the substring to a regexp.

$ ruby --version
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-darwin17]
$ bundle list treetop
/Users/greg/.rvm/gems/ruby-2.6.2/gems/treetop-1.6.10

I hadn't heard about the quadratic issue with UTF-8 strings, which would definitely help to explain this. Other approaches I was considering to help with this:

memoizing the substring lookup (if there are many repeats)
memoizing the substring lookup and the regexp comparison (again, if there are many repeats)
investigating a way to test a regexep for a match on a substring without having to extract the substring from the original string. (eg. something like regexp.match?(s, offset, length))

cjheath · 2019-03-27T08:15:53Z

I repeat, this is a Ruby problem, not a Treetop one. I told Matz how to fix it SEVENTEEN YEARS AGO, and he still hasn't bothered, so I'm certainly not going to.

longhotsummer · 2019-03-27T10:01:01Z

@cjheath I'm interested in working together to explore a way to work around this, if possible. I don't know enough about Ruby's strings to know how to tackle the underlying issue. I'm willing to put some effort into this.

While I understand the root cause is Ruby, it does have a significant impact on Treetop users. Would it be possible to keep this issue open to discuss ideas and approaches? I would appreciate your insight into the underlying issue.

cjheath · 2019-03-27T10:42:28Z

Ok, happy to re-open it, though I'm uncertain how we can fix it.

The issue is that when processing an encoding by counting variable-sized units (UTF-8 chars) there is often a need to count bytes. The "remembered point" technique that I used and requested that Matz implement remembers the total string length (after the first time it has been counted!) but also remembers the offset and character number of the last-accessed position. Because you can scan forwards or backwards in UTF-8 strings, that gives the runtime three different possible starting places (start, last-used, end) and based on those counts, it can decide which is the best place to index from. Very often that will be a small distance from the last-used point, e.g. when you have linear processing in either forward or backward direction. It's not rocket science, but it makes a big difference in total time spent searching.

Bear in mind that the input is not necessarily always a simple String. Because of a design error made very early on, some users (like my activefacts-cql parser) have to use a Delegate or duck-type for the String. Relying on it being a real String might produce breaking change for some users. We can't, for example break the string into N fragments (each fragment being a copy, not a sub-string) - and we shouldn't have to do that anyhow.

Any other suggestions?

longhotsummer · 2019-03-28T06:27:50Z

Thanks for the details, very interesting.

I think documenting this for users is crucial. I've been frustrated by this performance problem for a few months now, and understanding the issue let me solve it for my use case in a few hours. It has reduced parse times for some documents from 24 seconds to 3 seconds. I have another document that took an hour to parse which I have yet to test.

My use case is parsing plain text UTF-8 documents into XML. In this case, my workaround is to escape non-ASCII chars using URL-style %-encoding, force the string down to ASCII-8bit, parse it, and then un-escape the resulting XML back to UTF-8.

I can do this because my parser only uses ASCII literals and I trust the output that I get back from it. Other use cases may be different by something similar may be possible.

Here's my code:

require 'uri'
# use %-encoding to escape everything outside of the US_ASCII range,
# including encoding % itself.
unsafe = (0..126).to_a - ['%'.ord]
unsafe = unsafe.map { |i| '\u%04x' % i }
unsafe = Regexp.new('[^' + unsafe.join('') + ']')
# escape unsafe chars, this returns a US_ASCII encoded string
text = URI::DEFAULT_PARSER.escape(text, unsafe)

xml = xml_from_tree(@parser.parse(text))

# unescape, returns a UTF-8 encoded string
URI.unescape(xml)

longhotsummer · 2019-03-28T06:48:28Z

Another option is to memoize the substring lookups. On an example parse I've done, of 417875 substring lookups, 196296 (47%) are unique and the remainder occur more than once and would benefit from memoizing. This could be useful if a grammar has significant backtracking.

A third option is to document that users can provide their own implementation of has_terminal? by including a module into their grammar.

module MyModule
  def has_terminal?(terminal, mode, index)
    # custom implementation
  end
end

grammar Foo
  include MyModule

  rule start
    ...
  end
end

Finally, documenting that users can override their input string's [] method to provide their own substring implementation if necessary.

cjheath · 2019-03-28T06:56:06Z

You might be able to make it work. But remember, a memo is a String, not the Delegate or whatever is ducking for the input. So care would be needed. Also, has_terminal takes index, which it uses to... index the input. Whatever else it does, it has to look at the text.

cjheath · 2019-03-28T07:13:58Z

BTW, the "design error" I referred to was in passing around "input" instead of "parser" as a parameter down the descent. input is available from parser, but not vice versa. To add a fourth parameter would have slowed things down (and still been a breaking change), so we left it as-is and I added a "parser" method to my input Delegate.

This provides about an 8x speed improvement. See cjheath/treetop#31

See cjheath/treetop#31

longhotsummer · 2019-03-29T14:02:01Z

Using this workaround, a document that previously took an hour to parse, now takes 1 minute 55 seconds.

cjheath · 2019-03-29T21:48:17Z

Yes, the square law is like that. How big is the document? We should document that Ruby has an issue indexing large UTF-8 strings, but doing much more than that is pointless. I'll happily accept a documentation PR.

longhotsummer · 2019-03-30T08:00:35Z

The document is 1.2MB. I'll submit a PR.

longhotsummer · 2019-03-30T16:27:09Z

I did some testing with memoizing the substring lookup and it seems it's not really worth it. For a 100KB file it reduces parse times for my grammar from 16 seconds to 14 seconds. For the same input forced to ascii-encoding, it increased parse times from 3 seconds to 5 seconds.

def text.[](*args)
  @_memo ||= {}
  x = @_memo[args]
  x = @_memo[args] = super if x.nil?
  x
end

cjheath · 2019-04-01T00:11:05Z

Memoizing might not help, but the same technique could be used to record caller locations and frequency to see if there are any redundant calls being made. Also, we could evaluate whether taking a substring returns a reference to the original string data or a copy, because if it's a reference, we could use that to reduce the length of each index search for the byte offset of a given character. I don't expect to make these experiments anytime soon, but you might get inspired.

longhotsummer · 2019-05-08T09:14:23Z

#38 updates the documentation to talk about this issue. This issue can probably be closed since it deals specifically with the documentation issue.

cjheath · 2019-05-08T11:56:20Z

I wish I felt motivated enough to make a PR for Ruby's core String support...

kadarmarton · 2020-09-21T17:53:30Z

I am totally new to treetop and grammars, and all I understand here is some problem with variable length string encoding. So have you thought of making the problematic operation work on UTF-16 strings internally? Maybe I am totally missing the point though. Just free association based on text fragments I caught in your discussion.

cjheath · 2020-09-21T22:17:26Z

@kadarmarton No that wouldn't help. UTF-16 is still a variable-length encoding, so you can't just index to get characters by position, you need to count. What would help is if Matz had taken the advice I gave him in 2002 about how to implement UTF-8 strings with a "remembered point", which avoids most cases of having to count from the start of the string every time you index. Sigh. Plus ça change, plus c'est la même chose...

cabo · 2024-02-10T14:28:26Z

So what is missing in Ruby that treetop would need to be efficient?

Index a string using a byte offset
Increment the byte offset by one character (which would be +1 for character indexes)
Possibly slice using byte offsets, apply a regexp at a byte offset, find the byte size of the match?

This doesn't sound like much to add to Ruby.

cjheath · 2024-03-01T06:24:30Z

@cabo If you think so, I suggest you do it.

Personally I think there's quite a lot more to it, and convincing the community that it's in everyone's interests is even harder.

cjheath closed this as completed Mar 27, 2019

cjheath reopened this Mar 27, 2019

longhotsummer added a commit to laws-africa/slaw that referenced this issue Mar 29, 2019

HACK to parse ascii-8bit encoding

970bb70

This provides about an 8x speed improvement. See cjheath/treetop#31

longhotsummer added a commit to laws-africa/slaw that referenced this issue Mar 29, 2019

Add --ascii option for grammars that don't use unicode

8386b84

See cjheath/treetop#31

longhotsummer mentioned this issue Mar 29, 2019

Use us-ascii encoding for speed if possible laws-africa/slaw#31

Merged

longhotsummer mentioned this issue Mar 30, 2019

Document exponential parse times for unicode strings #38

Merged

cjheath closed this as completed May 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation should explain when treetop parser is quadratic #31

Documentation should explain when treetop parser is quadratic #31

cabo commented Jan 18, 2018

cjheath commented Jan 19, 2018

longhotsummer commented Mar 27, 2019 •

edited

Loading

cjheath commented Mar 27, 2019

longhotsummer commented Mar 27, 2019

cjheath commented Mar 27, 2019 •

edited

Loading

longhotsummer commented Mar 28, 2019

longhotsummer commented Mar 28, 2019

cjheath commented Mar 28, 2019

cjheath commented Mar 28, 2019

longhotsummer commented Mar 29, 2019

cjheath commented Mar 29, 2019

longhotsummer commented Mar 30, 2019

longhotsummer commented Mar 30, 2019

cjheath commented Apr 1, 2019

longhotsummer commented May 8, 2019

cjheath commented May 8, 2019

kadarmarton commented Sep 21, 2020

cjheath commented Sep 21, 2020 •

edited

Loading

cabo commented Feb 10, 2024

cjheath commented Mar 1, 2024

Documentation should explain when treetop parser is quadratic #31

Documentation should explain when treetop parser is quadratic #31

Comments

cabo commented Jan 18, 2018

cjheath commented Jan 19, 2018

longhotsummer commented Mar 27, 2019 • edited Loading

cjheath commented Mar 27, 2019

longhotsummer commented Mar 27, 2019

cjheath commented Mar 27, 2019 • edited Loading

longhotsummer commented Mar 28, 2019

longhotsummer commented Mar 28, 2019

cjheath commented Mar 28, 2019

cjheath commented Mar 28, 2019

longhotsummer commented Mar 29, 2019

cjheath commented Mar 29, 2019

longhotsummer commented Mar 30, 2019

longhotsummer commented Mar 30, 2019

cjheath commented Apr 1, 2019

longhotsummer commented May 8, 2019

cjheath commented May 8, 2019

kadarmarton commented Sep 21, 2020

cjheath commented Sep 21, 2020 • edited Loading

cabo commented Feb 10, 2024

cjheath commented Mar 1, 2024

longhotsummer commented Mar 27, 2019 •

edited

Loading

cjheath commented Mar 27, 2019 •

edited

Loading

cjheath commented Sep 21, 2020 •

edited

Loading