Support for parser combinators #27

axman6 · 2017-05-24T04:39:07Z

Many of the examples of regexes are reached the point where a parser combinator library would be a much better option - a prime example is the URL matcher which can easily be precisely defined using a parser combinator, while at the moment it's fairly ad hoc and loses a lot of information (the path doesn't work for URLs which contain usernames and passwords, something users might want to be able to match on to forbid or warn users who're posting URLs they shouldn't):

ruleURL :: Rule
ruleURL = Rule
  { name = "url"
  , pattern =
    [ regex "((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
    ]
  , prod = \tokens -> case tokens of
      (Token RegexMatch (GroupMatch (m:_:_protocol:_:domain:_:_:_port:_path:_query:_)):
       _) -> Just . Token Url $ url m domain
      _ -> Nothing
  }

(For this specific example, the Network.URI package already provides parseURI :: String -> Maybe URI)

I don't have an implementation for this yet (nor a preference for combinator library) because I don't fully understand how duckling all fits together, and wanted to open this to start discussion about it.

The text was updated successfully, but these errors were encountered:

niteria · 2017-05-25T11:52:42Z

Hi @axman6,

You bring up an interesting question. I've been thinking about this and I have some ideas, but I will need a bit of time to express them in more detail.

The TL;DR is:

Using Regexps in Duckling sets the barrier to contribution quite low. Duckling benefited a lot from this, previously in the Clojure incarnation and now already as a Haskell project. Most of the rules were contributed.
There's some optimizations we can do with Regexps that we cannot with parser combinators. I have a change in progress that compiles a giant NFA from all the heads of rules, which can filter out rules that don't have a chance of applying. On the test cases that I've tested it with it halves the cost of Duckling. This change basically makes it so that the cost of Duckling doesn't increase with each added rule.
There's room for a combinator like constructs in Duckling, but probably not using some stock combinator library. Using a Free Applicative we could make the rules expressible with:

{-# LANGUAGE ApplicativeDo #-}
ruleURL :: Rule
ruleURL = mkRule "url" $ do
  m:_:_protocol:_:domain:_:_:_port:_path:_query:_ <- regex 
    "((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
  return $ Token Url $ url m domain

The shortcomings of parsing URLs today are orthogonal to the choice of underlying parsing library. We could make it work with Regexps. Or we can have a Regexp that overapproximates and use parseURI :: String -> Maybe URI to filter out false positives.

I hope to elaborate on 2) and 3) more in the future, this is just a sketch.

Anyways, thanks for thinking about this, if you have ideas we're happy to hear them.

s4s0l · 2017-05-29T20:57:14Z

what is NFA?

niteria · 2017-05-29T21:13:41Z

NFA - https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
Compiling regexps down to NFAs gives predictable polynomial performance.

dmvianna · 2017-12-14T10:44:07Z

Would you consider supporting both regexes and parser combinators? I take your point in the simple cases of URLs and emails. However this machinery and contribution model could easily encompass, say, street addresses. I find many parallels between Duckling's and libpostal's architecture, but I find it unlikely we could achieve a sufficiently good street address parser using regex, even with community support.

This might be a non-goal, but it is a worthy one.

patapizza · 2017-12-21T01:46:19Z

@dmvianna I wouldn't discard the idea, though the added value should gain on the complexity cost.
You're right about the use of regexes for street addresses; it doesn't work properly. At this point postal addresses are out of scope, as they require a resolution based on an external datasource.

It seems like libpostal is a good candidate for that, so I wouldn't jump in reinventing the wheel. What are the limitations that could be overcome by Duckling?

dmvianna · 2017-12-21T02:22:45Z

Recursion. As far as I understand, libpostal is not a good candidate for parsing free text. For example, let's say I want to parse street addresses out of Patents Abstracts (I have some toy code and data here). Or other cases I routinely find at work, such as 1-25, 30-40 and 150 Collins St or even 2 Fictional Rd and 3B Epic Av. libpostal simply won't parse multiple addresses in one string properly. And guess what, even regex can parse common cases for these examples. Parser combinators can improve on that.

What naïve code won't do is to assign St to Street or Saint according to local context. libpostal sorta does that by returning all alternatives. Duckling as I understand does implement a probabilistic parser, so it could potentially solve both problems.

patapizza · 2018-01-18T00:05:26Z

@dmvianna I'm open to look at a prototype for US postal addresses to see if we can get something useful even without the resolution part.

patapizza added the enhancement label Aug 1, 2017

patapizza added the help wanted label Jan 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for parser combinators #27

Support for parser combinators #27

axman6 commented May 24, 2017 •

edited

niteria commented May 25, 2017 •

edited

s4s0l commented May 29, 2017

niteria commented May 29, 2017

dmvianna commented Dec 14, 2017 •

edited

patapizza commented Dec 21, 2017

dmvianna commented Dec 21, 2017 •

edited

patapizza commented Jan 18, 2018

Support for parser combinators #27

Support for parser combinators #27

Comments

axman6 commented May 24, 2017 • edited

niteria commented May 25, 2017 • edited

s4s0l commented May 29, 2017

niteria commented May 29, 2017

dmvianna commented Dec 14, 2017 • edited

patapizza commented Dec 21, 2017

dmvianna commented Dec 21, 2017 • edited

patapizza commented Jan 18, 2018

axman6 commented May 24, 2017 •

edited

niteria commented May 25, 2017 •

edited

dmvianna commented Dec 14, 2017 •

edited

dmvianna commented Dec 21, 2017 •

edited