Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation should explain when treetop parser is quadratic #31

Closed
cabo opened this issue Jan 18, 2018 · 20 comments
Closed

Documentation should explain when treetop parser is quadratic #31

cabo opened this issue Jan 18, 2018 · 20 comments

Comments

@cabo
Copy link

cabo commented Jan 18, 2018

When parsing a UTF-8 string, treetop's run time is quadratic with respect to the input size. When parsing BINARY ("ASCII-8BIT"), the run time is mostly linear. That behavior doesn't come up in small test cases and may surprise users.

(The reason is that treetop heavily uses indexing into a String instance, and that is generally linear in UTF-8 and constant-time in BINARY. Maybe there is a way to get the index caching in the Ruby String implementation to work for treetop, but it doesn't work for me.)

@cjheath
Copy link
Owner

cjheath commented Jan 19, 2018

Thank you for letting me know about the problem. I wasn't aware that Ruby strings had index caching. Can you please explore this possible solution further.

I find this frankly rather hilarious, given this discussion I had with Matz in 2002, over 15 years ago now!
https://groups.google.com/d/embed/msg/comp.lang.ruby/XcKQSG0MfRk/-WdVw3ZMmLMJ

@longhotsummer
Copy link

longhotsummer commented Mar 27, 2019

I'm interested in looking into this, too. Profiling with ruby-prof shows that 50% of parsing time is spent indexing into strings, mostly to compare the substring to a regexp.

$ ruby --version
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-darwin17]
$ bundle list treetop
/Users/greg/.rvm/gems/ruby-2.6.2/gems/treetop-1.6.10

profile html#Array_ _70207986507560 2019-03-27 09-46-06

I hadn't heard about the quadratic issue with UTF-8 strings, which would definitely help to explain this. Other approaches I was considering to help with this:

  • memoizing the substring lookup (if there are many repeats)
  • memoizing the substring lookup and the regexp comparison (again, if there are many repeats)
  • investigating a way to test a regexep for a match on a substring without having to extract the substring from the original string. (eg. something like regexp.match?(s, offset, length))

@cjheath
Copy link
Owner

cjheath commented Mar 27, 2019

I repeat, this is a Ruby problem, not a Treetop one. I told Matz how to fix it SEVENTEEN YEARS AGO, and he still hasn't bothered, so I'm certainly not going to.

@cjheath cjheath closed this as completed Mar 27, 2019
@longhotsummer
Copy link

@cjheath I'm interested in working together to explore a way to work around this, if possible. I don't know enough about Ruby's strings to know how to tackle the underlying issue. I'm willing to put some effort into this.

While I understand the root cause is Ruby, it does have a significant impact on Treetop users. Would it be possible to keep this issue open to discuss ideas and approaches? I would appreciate your insight into the underlying issue.

@cjheath
Copy link
Owner

cjheath commented Mar 27, 2019

Ok, happy to re-open it, though I'm uncertain how we can fix it.

The issue is that when processing an encoding by counting variable-sized units (UTF-8 chars) there is often a need to count bytes. The "remembered point" technique that I used and requested that Matz implement remembers the total string length (after the first time it has been counted!) but also remembers the offset and character number of the last-accessed position. Because you can scan forwards or backwards in UTF-8 strings, that gives the runtime three different possible starting places (start, last-used, end) and based on those counts, it can decide which is the best place to index from. Very often that will be a small distance from the last-used point, e.g. when you have linear processing in either forward or backward direction. It's not rocket science, but it makes a big difference in total time spent searching.

Bear in mind that the input is not necessarily always a simple String. Because of a design error made very early on, some users (like my activefacts-cql parser) have to use a Delegate or duck-type for the String. Relying on it being a real String might produce breaking change for some users. We can't, for example break the string into N fragments (each fragment being a copy, not a sub-string) - and we shouldn't have to do that anyhow.

Any other suggestions?

@cjheath cjheath reopened this Mar 27, 2019
@longhotsummer
Copy link

Thanks for the details, very interesting.

I think documenting this for users is crucial. I've been frustrated by this performance problem for a few months now, and understanding the issue let me solve it for my use case in a few hours. It has reduced parse times for some documents from 24 seconds to 3 seconds. I have another document that took an hour to parse which I have yet to test.

My use case is parsing plain text UTF-8 documents into XML. In this case, my workaround is to escape non-ASCII chars using URL-style %-encoding, force the string down to ASCII-8bit, parse it, and then un-escape the resulting XML back to UTF-8.

I can do this because my parser only uses ASCII literals and I trust the output that I get back from it. Other use cases may be different by something similar may be possible.

Here's my code:

require 'uri'
# use %-encoding to escape everything outside of the US_ASCII range,
# including encoding % itself.
unsafe = (0..126).to_a - ['%'.ord]
unsafe = unsafe.map { |i| '\u%04x' % i }
unsafe = Regexp.new('[^' + unsafe.join('') + ']')
# escape unsafe chars, this returns a US_ASCII encoded string
text = URI::DEFAULT_PARSER.escape(text, unsafe)

xml = xml_from_tree(@parser.parse(text))

# unescape, returns a UTF-8 encoded string
URI.unescape(xml)

@longhotsummer
Copy link

Another option is to memoize the substring lookups. On an example parse I've done, of 417875 substring lookups, 196296 (47%) are unique and the remainder occur more than once and would benefit from memoizing. This could be useful if a grammar has significant backtracking.

A third option is to document that users can provide their own implementation of has_terminal? by including a module into their grammar.

module MyModule
  def has_terminal?(terminal, mode, index)
    # custom implementation
  end
end

grammar Foo
  include MyModule

  rule start
    ...
  end
end

Finally, documenting that users can override their input string's [] method to provide their own substring implementation if necessary.

@cjheath
Copy link
Owner

cjheath commented Mar 28, 2019

You might be able to make it work. But remember, a memo is a String, not the Delegate or whatever is ducking for the input. So care would be needed. Also, has_terminal takes index, which it uses to... index the input. Whatever else it does, it has to look at the text.

@cjheath
Copy link
Owner

cjheath commented Mar 28, 2019

BTW, the "design error" I referred to was in passing around "input" instead of "parser" as a parameter down the descent. input is available from parser, but not vice versa. To add a fourth parameter would have slowed things down (and still been a breaking change), so we left it as-is and I added a "parser" method to my input Delegate.

longhotsummer added a commit to laws-africa/slaw that referenced this issue Mar 29, 2019
This provides about an 8x speed improvement.

See cjheath/treetop#31
longhotsummer added a commit to laws-africa/slaw that referenced this issue Mar 29, 2019
@longhotsummer
Copy link

Using this workaround, a document that previously took an hour to parse, now takes 1 minute 55 seconds.

@cjheath
Copy link
Owner

cjheath commented Mar 29, 2019

Yes, the square law is like that. How big is the document? We should document that Ruby has an issue indexing large UTF-8 strings, but doing much more than that is pointless. I'll happily accept a documentation PR.

@longhotsummer
Copy link

The document is 1.2MB. I'll submit a PR.

@longhotsummer
Copy link

I did some testing with memoizing the substring lookup and it seems it's not really worth it. For a 100KB file it reduces parse times for my grammar from 16 seconds to 14 seconds. For the same input forced to ascii-encoding, it increased parse times from 3 seconds to 5 seconds.

def text.[](*args)
  @_memo ||= {}
  x = @_memo[args]
  x = @_memo[args] = super if x.nil?
  x
end

@cjheath
Copy link
Owner

cjheath commented Apr 1, 2019

Memoizing might not help, but the same technique could be used to record caller locations and frequency to see if there are any redundant calls being made. Also, we could evaluate whether taking a substring returns a reference to the original string data or a copy, because if it's a reference, we could use that to reduce the length of each index search for the byte offset of a given character. I don't expect to make these experiments anytime soon, but you might get inspired.

@longhotsummer
Copy link

#38 updates the documentation to talk about this issue. This issue can probably be closed since it deals specifically with the documentation issue.

@cjheath
Copy link
Owner

cjheath commented May 8, 2019

I wish I felt motivated enough to make a PR for Ruby's core String support...

@cjheath cjheath closed this as completed May 8, 2019
@kadarmarton
Copy link

I am totally new to treetop and grammars, and all I understand here is some problem with variable length string encoding. So have you thought of making the problematic operation work on UTF-16 strings internally? Maybe I am totally missing the point though. Just free association based on text fragments I caught in your discussion.

@cjheath
Copy link
Owner

cjheath commented Sep 21, 2020

@kadarmarton No that wouldn't help. UTF-16 is still a variable-length encoding, so you can't just index to get characters by position, you need to count. What would help is if Matz had taken the advice I gave him in 2002 about how to implement UTF-8 strings with a "remembered point", which avoids most cases of having to count from the start of the string every time you index. Sigh. Plus ça change, plus c'est la même chose...

@cabo
Copy link
Author

cabo commented Feb 10, 2024

So what is missing in Ruby that treetop would need to be efficient?

  • Index a string using a byte offset
  • Increment the byte offset by one character (which would be +1 for character indexes)
  • Possibly slice using byte offsets, apply a regexp at a byte offset, find the byte size of the match?

This doesn't sound like much to add to Ruby.

@cjheath
Copy link
Owner

cjheath commented Mar 1, 2024

@cabo If you think so, I suggest you do it.

Personally I think there's quite a lot more to it, and convincing the community that it's in everyone's interests is even harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants