Skip to content

Camlp5-compatible PPX rewriters for Perl-ish idiomatic regexps

License

Notifications You must be signed in to change notification settings

camlp5/pa_ppx_regexp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pa_ppx_regexp: version 0.01: a PPX Rewriter for Perl-ish regexp operations

OCaml has support for regular expressions using a number of different packages: str, pcre, and re. These all present a functional interface with a number of different verbs and data-types, and any sort of nontrivial pattern-matching or match-and-replace involves significant code that can fairly be described as boilerplate. This code is a pain to read/understand and to maintain as requirements change.

By contrast, in Perl regular expressions are integrated into the language, in such a way that matching, extracting substrings, and match-and-replace are compactly expressed with little superfluous boilerplate. The result is that, once one has a grasp of the syntax of regular expression integration into the language, the use of regexps in Perl is transparent in a way that is difficult-to-match in other languages.

This package is a first attempt to bring that level of transparency to OCaml. This package supports:

  • Perl-syntax regexps via both ocaml-re and pcre. The default is to use ocaml-re.

  • regexp matching ([%match ....])

  • splitting with regexps ([%split ....])

  • pattern-substitution with the result of regexp matching or OCaml string-typed variables ([%pattern ...])

  • match-and-replace either once or globally ([%subst ...])

All of these support transparent syntax for "modifiers" in the style of Perl’s modifiers, and a few extra ones for OCaml specifically.

NOTE WELL that sometimes ocaml-re 's Perl mode differs from pcre; that is why both are supported. In the examples, I’ll show usage of both, and will point out when results differ.

To introduce the capabilities of this package, I’ll first provide examples, and then explain the syntax and capabilities of the system. As always, the tests are an excellent place to look usage hints.

Installation

To install, it should suffice to opam install pa_ppx_regexp; from source, make all install in the top-level directory should do the trick.

Invocation

From the OCaml compiler

To invoke the PPX rewriter for a source file, the package pa_ppx_regexp must be specified, viz.

ocamlfind ocamlc -package pa_ppx_regexp -c -syntax camlp5o ppx_regexp.ml

and similarly when linking (there is a small runtime support package required

ocamlfind ocamlc -package pa_ppx_regexp -linkpkg -linkall -o ppx_regexp.exe ppx_regexp.cmo

From the OCaml toplevel

# #use "topfind.camlp5";;
- : unit = ()
Findlib has been successfully loaded. Additional directives:
  #require "package";;      to load a package
  #list;;                   to list the available packages
  #camlp4o;;                to load camlp4 (standard syntax)
  #camlp4r;;                to load camlp4 (revised syntax)
  #predicates "p,q,...";;   to set these predicates
  Topfind.reset();;         to force that packages will be reloaded
  #thread;;                 to enable threads

- : unit = ()
Additional Camlp5 directives:
  #camlp5o;;                to load camlp5 (standard syntax)
  #camlp5r;;                to load camlp5 (revised syntax)

- : unit = ()
# #camlp5o ;;
: added to search path
: added to search path
: loaded
: added to search path
: loaded
# #require "pa_ppx_regexp" ;;
# let exc_converter = function
    Exit as exc ->
    let s = Printexc.to_string exc in
    Some (Location.error s)
  | _ -> None
val exc_converter : exn -> Location.report option = <fun>
# Location.register_error_of_exn exc_converter ;;
- : unit = ()

Examples

A match regexp with capture groups (like Perl’s m/.../)

In Perl we can break the "shebang" line in $x ("#!/foo/bar/buzz abc def") into a directory and the filename in that directory, which yields $& (the entire matched substring), $1 and $2, the directory and filename.

$ perl -MData::Dumper -e '$x = "#!/foo/bar/buzz  abc  def"; \
> $x =~ m,^#!(\S+)/([^/\s]+),; \
> print Dumper([$&, $1, $2]);'
$VAR1 = [
          '#!/foo/bar/buzz',
          '/foo/bar',
          'buzz'
        ];

In OCaml we can write:

# [%match {|^#!(\S+)/([^/\s]+)|}] ;;
- : string -> (string * string option * string option) option = <fun>
# [%match {|^#!(\S+)/([^/\s]+)|}] "#!/foo/bar/buzz  abc  def";;
- : (string * string option * string option) option =
Some ("#!/foo/bar/buzz", Some "/foo/bar", Some "buzz")
# [%match {|^#!(\S+)/([^/\s]+)|}/re_perl] "#!/foo/bar/buzz  abc  def";;
- : (string * string option * string option) option =
Some ("#!/foo/bar/buzz", Some "/foo/bar", Some "buzz")
# [%match {|^#!(\S+)/([^/\s]+)|}/pcre] "#!/foo/bar/buzz  abc  def";;
- : (string * string option * string option) option =
Some ("#!/foo/bar/buzz", Some "/foo/bar", Some "buzz")

NOTE that the default is re_perl. Hereinafter examples will all be with both re_perl and pcre.

Splitting a string with a regexp (like Perl’s split(/.../,...))

In Perl we can split a string with a regexp:

$ perl -MData::Dumper -e '$x = "abcdaceabc" ; \
> @l = split(/ab?c/, $x); \
> print Dumper([@l]);'
$VAR1 = [
          '',
          'd',
          'e'
        ];

In OCaml we can write:

# [%split {|ab?c|}/re_perl] "abcdaceabc" ;;
- : string list = ["d"; "e"]
# [%split {|ab?c|}/pcre] "abcdaceabc" ;;
- : string list = [""; "d"; "e"]

NOTE WELL: there’s a discrepancy here between pcre and ocaml-re.

In Perl we can also use capture-groups with a split:

$ perl -MData::Dumper -e '$x = "abcdaceabc" ; \
> @l = split(/a(b)?c/, $x); \
> print Dumper([@l]);'
$VAR1 = [
          '',
          'b',
          'd',
          undef,
          'e',
          'b'
        ];

In OCaml, we can write

# [%split {|a(b)?c|} / strings re_perl] "abcdaceabc" ;;
- : [> `Delim of string * string option | `Text of string ] list =
[`Delim ("abc", Some "b"); `Text "d"; `Delim ("ac", None); `Text "e";
 `Delim ("abc", Some "b")]
# [%split {|a(b)?c|} / strings pcre] "abcdaceabc" ;;
- : [> `Delim of string * string option | `Text of string ] list =
[`Delim ("abc", Some "b"); `Text "d"; `Delim ("ac", None); `Text "e";
 `Delim ("abc", Some "b")]

This is much more complicated, so let’s walk thru it:

  • first, the delimiter, "abc" (the matched string), with the (matched) capture-group "b".

  • then the text "d"

  • then the delimiter "ac" with an unmatched capture-group.

  • then the text "e"

  • then the delimiter "abc" again, with the matched capture group "b".

This is a lot of work, when we might not want it all, so there’s a way of limiting the amount of extracted substrings, that we’ll come to later.

NOTE the "strings" above. We’ll come to this later on.

match-and-replace with a regexp/pattern (like Perl’s s/.../.../)

In Perl we can match-and-replace a regexp with a string substitution pattern (expression patterns are right after):

$ perl -MData::Dumper -e '$x = "abc\nabc"; \
> $x =~ s,a(bc),<<$1>>,; \
> print Dumper($x);'
$VAR1 = '<<bc>>
abc';

or (to refer to local Perl variables)

$ perl -MData::Dumper -e '$lhs = "<<" ; $rhs = ">>" ; $x = "abc\nabc"; \
> $x =~ s,a(bc),${lhs}$1${rhs},; \
> print Dumper($x);'
$VAR1 = '<<bc>>
abc';

In OCaml we can do the same:

# [%subst {|a(bc)|} / {|<<$1>>|}/re_perl] "abc\nabc" ;;
- : string = "<<bc>>\nabc"
# [%subst {|a(bc)|} / {|<<$1>>|}/pcre] "abc\nabc" ;;
- : string = "<<bc>>\nabc"

or (to refer to local OCaml variables)

# let lhs = "<<" and rhs = ">>" in [%subst {|a(bc)|} / {|${lhs}$1${rhs}|}/re_perl] "abc\nabc" ;;
- : string = "<<bc>>\nabc"
# let lhs = "<<" and rhs = ">>" in [%subst {|a(bc)|} / {|${lhs}$1${rhs}|}/pcre] "abc\nabc" ;;
- : string = "<<bc>>\nabc"

In Perl instead of a string pattern for the right-hand-side of the substition, we can use a Perl expression to compute the substitution, in which special variables are be used to access the capture-groups (NOTE: look for e in the modifiers):

$lhs = "<<" ; $rhs = ">>" ;
$x = "abc\nabc"; $x =~ s,a(bc),$lhs . $1 . $rhs,e;

and likewise in OCaml:

let lhs = "<<" ;;
let rhs = ">>" ;;
[%subst {|a(bc)|} / {|lhs ^ $1$ ^ rhs|} / e re_perl] "abc\nabc"
[%subst {|a(bc)|} / {|lhs ^ $1$ ^ rhs|} / e pcre] "abc\nabc"

NOTE the difference in the way that capture-groups are named in the pattern vs. in the expression. This is due to the need to conform to Camlp5 antiquotation syntax. AND NOTE again the presence of e in the modifiers for "expression patterns".

patterns

Implicit in Perl’s s/re/pat/ match-and-replace operation is the idea of a pattern. Such a pattern can be either a string with antiquotations for variables and capture-groups, or a Perl expression with antiquotations for capture-groups (since expressions already include variables). So in OCaml we have a type of "pattern" for this, and we’ve already seen both kinds just above.

First there are strings with antiquotations for variables and capture-groups:

# [%pattern {|<<$1>>|}/re_perl] ;;
- : Re.substrings -> string = <fun>
# [%pattern {|<<$1>>|}/pcre] ;;
- : Pcre.substrings -> string = <fun>

or

# fun lhs rhs -> [%pattern {|${lhs}$1${rhs}|}/re_perl] ;;
- : string -> string -> Re.substrings -> string = <fun>
# fun lhs rhs -> [%pattern {|${lhs}$1${rhs}|}/pcre] ;;
- : string -> string -> Pcre.substrings -> string = <fun>

and also an expression with antiquotations for capture-groups:

# fun lhs rhs -> [%pattern {|lhs ^ $1$ ^ rhs|} / e re_perl] ;;
- : string -> string -> Re.substrings -> string = <fun>
# fun lhs rhs -> [%pattern {|lhs ^ $1$ ^ rhs|} / e pcre] ;;
- : string -> string -> Pcre.substrings -> string = <fun>

NOTE that just as in Perl s///, to indicate that the pattern is an expression, we use the modifier e. Also note that the type for substrings is different when using re_perl from when using pcre.

In a string pattern, antiquotations are either ${varname} or (for capture groups) $N (or ${N}) (where N is an integer constant). In an expression variables are already expressible, and capture groups are expressed as $N$ (where N is an integer constant).

A pattern that doesn’t have any capture-groups has type string; a pattern that does have capture-groups has type Re.substrings -> string (or Pcre.substrings -> string) (since those capture-groups will have to be taken from some already-matched regexp, and a matched regexp produces a Re.substrings (or Pcre.substrings)).

The high-level syntax of these extensions

The extensions all have common syntax aspects. Extensions look like:

  • [%match *regexp*] or [%match *regexp* / *modifiers*]

  • [%split *regexp*] or [%split *regexp* / *modifiers*]

  • [%pattern *pattern*] or [%pattern *pattern* / *modifiers*]

  • [%subst *regexp* / *pattern*] or [%subst *regexp* / *pattern* / *modifiers*]

There are five kinds of modifiers, and different kinds are allowed for different extensions:

  • choice of which regexp backend: allowed for all extensions, and default re_perl

    • re_perl: the ocaml-re backend, using Re.Perl

    • pcre: the pcre backend

  • regexp compile-time modifiers: allowed for match, split, subst

    • i: case-insensitive regexp

    • s: treat string being matched as a single line (like Perl /s)

    • m: treat string being matched as multiple lines (like Perl /m)

m and s are mutually-exclusive

  • regexp output modifiers: allowed for match, split

    • exc: raise Not_found if the regexp does not match or mandatory capture-groups did not match.

    • raw: return a Re.substrings

    • strings: return a tuple of string option for each capture-group

    • pred: return a boolean for whether or not the regexp successfully matched

raw, strings, and pred are all mutually-exclusive. `pred and exc are mutually-exclusive. Also, strings can take parameters, which are explained below.

  • pattern modifiers: allowed for pattern and subst

    • e: the pattern is an OCaml expression, not a string

  • substitution modifiers: allowed for subst

    • g: apply the substitution to every occurrence of the regexp, not just the first one

The raw and strings modifiers

A regexp, when applied to some input string, can match, or fail to match. The most primitive result it can produce is a Re.substrings, which holds the substrings of the input that matched the capture groups of the regexp. So the result type of a regexp match should be Re.substrings option

With the exc modifier (which causes Not_found to be raised on match failure), this becomes Re.substrings.

To get these result types, we use the modifier raw. But a Re.substrings is a complex object and we might want something more transparent. A natural thing to want, is a tuple of all the capture-groups. So let’s consider a regexp: (a)?(b)(c)?. This regexp has four capture groups:

  • 0: the entire matched substring

  • 1: the substring that matches (a)

  • 2: the substring that matches (b)

  • 3: the substring that matches (c)

If the regexp matches the string input, capture group 0 will be non-null. But capture groups 1,3 can be null even if the regexp matches the string input. Capture group 2 must match if the string matches, but let’s ignore that for now. The type of the regexp is

# [%match {|(a)?(b)(c)?|}/re_perl] ;;
- : string -> (string * string option * string option * string option) option
= <fun>
# [%match {|(a)?(b)(c)?|}/pcre] ;;
- : string -> (string * string option * string option * string option) option
= <fun>

since

  • it could fail to match (outermost option)

  • each of the capture groups 1, 2, 3 could fail (other option types)

If we’d prefer to have an exception (Not_found) on unsuccessful match, the exc modifier will do that for us:

# [%match {|(a)?(b)(c)?|} / exc re_perl] ;;
- : string -> string * string option * string option * string option = <fun>
# [%match {|(a)?(b)(c)?|} / exc pcre] ;;
- : string -> string * string option * string option * string option = <fun>

Perhaps we’d like only the second capture group:

# [%match {|(a)?(b)(c)?|} / exc strings 2 re_perl] ;;
- : string -> string option = <fun>
# [%match {|(a)?(b)(c)?|} / exc strings 2 pcre] ;;
- : string -> string option = <fun>

And since in the regexp that capture group must match for the entire regexp to match, we might want to dispense with the option:

# [%match {|(a)?(b)(c)?|} / exc strings !2 re_perl] ;;
- : string -> string = <fun>
# [%match {|(a)?(b)(c)?|} / exc strings !2 pcre] ;;
- : string -> string = <fun>

Examples of Errors

A match regexp with an unrecognized modifier

# [%match {|(a)?|} / foo];;
File "_none_", line 1, characters 19-22:
Failure: extract_options: malformed option: <<(MLast.ExLid (<loc>,
                                        (Pp_MLast.Ploc.VaVal "foo")))>>
Line 1, characters 0-0:
Error: Stdlib.Exit

A match regexp specifying more than one regexp syntax

# [%match {|(a)?|} / re_perl pcre];;
File "_none_", line 1, characters 0-32:
Failure: match extension: can specify at most one of <<re>>, <<pcre>>: re_perl pcre
Line 1, characters 0-0:
Error: Stdlib.Exit

About

Camlp5-compatible PPX rewriters for Perl-ish idiomatic regexps

Resources

License

Stars

Watchers

Forks

Packages

No packages published