Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parser combinators #27

Open
axman6 opened this issue May 24, 2017 · 7 comments
Open

Support for parser combinators #27

axman6 opened this issue May 24, 2017 · 7 comments

Comments

@axman6
Copy link

axman6 commented May 24, 2017

Many of the examples of regexes are reached the point where a parser combinator library would be a much better option - a prime example is the URL matcher which can easily be precisely defined using a parser combinator, while at the moment it's fairly ad hoc and loses a lot of information (the path doesn't work for URLs which contain usernames and passwords, something users might want to be able to match on to forbid or warn users who're posting URLs they shouldn't):

ruleURL :: Rule
ruleURL = Rule
  { name = "url"
  , pattern =
    [ regex "((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
    ]
  , prod = \tokens -> case tokens of
      (Token RegexMatch (GroupMatch (m:_:_protocol:_:domain:_:_:_port:_path:_query:_)):
       _) -> Just . Token Url $ url m domain
      _ -> Nothing
  }

(For this specific example, the Network.URI package already provides parseURI :: String -> Maybe URI)

I don't have an implementation for this yet (nor a preference for combinator library) because I don't fully understand how duckling all fits together, and wanted to open this to start discussion about it.

@niteria
Copy link

niteria commented May 25, 2017

Hi @axman6,

You bring up an interesting question. I've been thinking about this and I have some ideas, but I will need a bit of time to express them in more detail.

The TL;DR is:

  1. Using Regexps in Duckling sets the barrier to contribution quite low. Duckling benefited a lot from this, previously in the Clojure incarnation and now already as a Haskell project. Most of the rules were contributed.

  2. There's some optimizations we can do with Regexps that we cannot with parser combinators. I have a change in progress that compiles a giant NFA from all the heads of rules, which can filter out rules that don't have a chance of applying. On the test cases that I've tested it with it halves the cost of Duckling. This change basically makes it so that the cost of Duckling doesn't increase with each added rule.

  3. There's room for a combinator like constructs in Duckling, but probably not using some stock combinator library. Using a Free Applicative we could make the rules expressible with:

{-# LANGUAGE ApplicativeDo #-}
ruleURL :: Rule
ruleURL = mkRule "url" $ do
  m:_:_protocol:_:domain:_:_:_port:_path:_query:_ <- regex 
    "((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
  return $ Token Url $ url m domain
  1. The shortcomings of parsing URLs today are orthogonal to the choice of underlying parsing library. We could make it work with Regexps. Or we can have a Regexp that overapproximates and use parseURI :: String -> Maybe URI to filter out false positives.

I hope to elaborate on 2) and 3) more in the future, this is just a sketch.

Anyways, thanks for thinking about this, if you have ideas we're happy to hear them.

@s4s0l
Copy link

s4s0l commented May 29, 2017

what is NFA?

@niteria
Copy link

niteria commented May 29, 2017

NFA - https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
Compiling regexps down to NFAs gives predictable polynomial performance.

@dmvianna
Copy link

dmvianna commented Dec 14, 2017

Would you consider supporting both regexes and parser combinators? I take your point in the simple cases of URLs and emails. However this machinery and contribution model could easily encompass, say, street addresses. I find many parallels between Duckling's and libpostal's architecture, but I find it unlikely we could achieve a sufficiently good street address parser using regex, even with community support.

This might be a non-goal, but it is a worthy one.

@patapizza
Copy link
Contributor

@dmvianna I wouldn't discard the idea, though the added value should gain on the complexity cost.
You're right about the use of regexes for street addresses; it doesn't work properly. At this point postal addresses are out of scope, as they require a resolution based on an external datasource.

It seems like libpostal is a good candidate for that, so I wouldn't jump in reinventing the wheel. What are the limitations that could be overcome by Duckling?

@dmvianna
Copy link

dmvianna commented Dec 21, 2017

Recursion. As far as I understand, libpostal is not a good candidate for parsing free text. For example, let's say I want to parse street addresses out of Patents Abstracts (I have some toy code and data here). Or other cases I routinely find at work, such as 1-25, 30-40 and 150 Collins St or even 2 Fictional Rd and 3B Epic Av. libpostal simply won't parse multiple addresses in one string properly. And guess what, even regex can parse common cases for these examples. Parser combinators can improve on that.

What naïve code won't do is to assign St to Street or Saint according to local context. libpostal sorta does that by returning all alternatives. Duckling as I understand does implement a probabilistic parser, so it could potentially solve both problems.

@patapizza
Copy link
Contributor

@dmvianna I'm open to look at a prototype for US postal addresses to see if we can get something useful even without the resolution part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants