OCaml has support for regular expressions using a number of different
packages: str
, pcre
, and re
. These all present a functional
interface with a number of different verbs and data-types, and any
sort of nontrivial pattern-matching or match-and-replace involves
significant code that can fairly be described as boilerplate. This
code is a pain to read/understand and to maintain as requirements
change.
By contrast, in Perl regular expressions are integrated into the language, in such a way that matching, extracting substrings, and match-and-replace are compactly expressed with little superfluous boilerplate. The result is that, once one has a grasp of the syntax of regular expression integration into the language, the use of regexps in Perl is transparent in a way that is difficult-to-match in other languages.
This package is a first attempt to bring that level of transparency to OCaml. This package supports:
-
Perl-syntax regexps via both
ocaml-re
andpcre
. The default is to useocaml-re
. -
regexp matching (
[%match ....]
) -
splitting with regexps (
[%split ....]
) -
pattern-substitution with the result of regexp matching or OCaml string-typed variables (
[%pattern ...]
) -
match-and-replace either once or globally (
[%subst ...]
)
All of these support transparent syntax for "modifiers" in the style of Perl’s modifiers, and a few extra ones for OCaml specifically.
NOTE WELL that sometimes ocaml-re
's Perl mode differs from pcre
;
that is why both are supported. In the examples, I’ll show usage of
both, and will point out when results differ.
To introduce the capabilities of this package, I’ll first provide examples, and then explain the syntax and capabilities of the system. As always, the tests are an excellent place to look usage hints.
To install, it should suffice to opam install pa_ppx_regexp
; from
source, make all install
in the top-level directory should do the
trick.
To invoke the PPX rewriter for a source file, the package
pa_ppx_regexp
must be specified, viz.
ocamlfind ocamlc -package pa_ppx_regexp -c -syntax camlp5o ppx_regexp.ml
and similarly when linking (there is a small runtime support package required
ocamlfind ocamlc -package pa_ppx_regexp -linkpkg -linkall -o ppx_regexp.exe ppx_regexp.cmo
# #use "topfind.camlp5";;
- : unit = ()
Findlib has been successfully loaded. Additional directives:
#require "package";; to load a package
#list;; to list the available packages
#camlp4o;; to load camlp4 (standard syntax)
#camlp4r;; to load camlp4 (revised syntax)
#predicates "p,q,...";; to set these predicates
Topfind.reset();; to force that packages will be reloaded
#thread;; to enable threads
- : unit = ()
Additional Camlp5 directives:
#camlp5o;; to load camlp5 (standard syntax)
#camlp5r;; to load camlp5 (revised syntax)
- : unit = ()
# #camlp5o ;;
: added to search path
: added to search path
: loaded
: added to search path
: loaded
# #require "pa_ppx_regexp" ;;
# let exc_converter = function
Exit as exc ->
let s = Printexc.to_string exc in
Some (Location.error s)
| _ -> None
val exc_converter : exn -> Location.report option = <fun>
# Location.register_error_of_exn exc_converter ;;
- : unit = ()
In Perl we can break the "shebang" line in $x
("#!/foo/bar/buzz abc def"
)
into a directory and the filename in that directory, which yields $&
(the
entire matched substring), $1
and $2
, the directory and filename.
$ perl -MData::Dumper -e '$x = "#!/foo/bar/buzz abc def"; \
> $x =~ m,^#!(\S+)/([^/\s]+),; \
> print Dumper([$&, $1, $2]);'
$VAR1 = [
'#!/foo/bar/buzz',
'/foo/bar',
'buzz'
];
In OCaml we can write:
# [%match {|^#!(\S+)/([^/\s]+)|}] ;;
- : string -> (string * string option * string option) option = <fun>
# [%match {|^#!(\S+)/([^/\s]+)|}] "#!/foo/bar/buzz abc def";;
- : (string * string option * string option) option =
Some ("#!/foo/bar/buzz", Some "/foo/bar", Some "buzz")
# [%match {|^#!(\S+)/([^/\s]+)|}/re_perl] "#!/foo/bar/buzz abc def";;
- : (string * string option * string option) option =
Some ("#!/foo/bar/buzz", Some "/foo/bar", Some "buzz")
# [%match {|^#!(\S+)/([^/\s]+)|}/pcre] "#!/foo/bar/buzz abc def";;
- : (string * string option * string option) option =
Some ("#!/foo/bar/buzz", Some "/foo/bar", Some "buzz")
NOTE that the default is re_perl
. Hereinafter examples will all be
with both re_perl
and pcre
.
In Perl we can split a string with a regexp:
$ perl -MData::Dumper -e '$x = "abcdaceabc" ; \
> @l = split(/ab?c/, $x); \
> print Dumper([@l]);'
$VAR1 = [
'',
'd',
'e'
];
In OCaml we can write:
# [%split {|ab?c|}/re_perl] "abcdaceabc" ;;
- : string list = ["d"; "e"]
# [%split {|ab?c|}/pcre] "abcdaceabc" ;;
- : string list = [""; "d"; "e"]
NOTE WELL: there’s a discrepancy here between pcre
and ocaml-re
.
In Perl we can also use capture-groups with a split:
$ perl -MData::Dumper -e '$x = "abcdaceabc" ; \
> @l = split(/a(b)?c/, $x); \
> print Dumper([@l]);'
$VAR1 = [
'',
'b',
'd',
undef,
'e',
'b'
];
In OCaml, we can write
# [%split {|a(b)?c|} / strings re_perl] "abcdaceabc" ;;
- : [> `Delim of string * string option | `Text of string ] list =
[`Delim ("abc", Some "b"); `Text "d"; `Delim ("ac", None); `Text "e";
`Delim ("abc", Some "b")]
# [%split {|a(b)?c|} / strings pcre] "abcdaceabc" ;;
- : [> `Delim of string * string option | `Text of string ] list =
[`Delim ("abc", Some "b"); `Text "d"; `Delim ("ac", None); `Text "e";
`Delim ("abc", Some "b")]
This is much more complicated, so let’s walk thru it:
-
first, the delimiter, "abc" (the matched string), with the (matched) capture-group "b".
-
then the text "d"
-
then the delimiter "ac" with an unmatched capture-group.
-
then the text "e"
-
then the delimiter "abc" again, with the matched capture group "b".
This is a lot of work, when we might not want it all, so there’s a way of limiting the amount of extracted substrings, that we’ll come to later.
NOTE the "strings" above. We’ll come to this later on.
In Perl we can match-and-replace a regexp with a string substitution pattern (expression patterns are right after):
$ perl -MData::Dumper -e '$x = "abc\nabc"; \
> $x =~ s,a(bc),<<$1>>,; \
> print Dumper($x);'
$VAR1 = '<<bc>>
abc';
or (to refer to local Perl variables)
$ perl -MData::Dumper -e '$lhs = "<<" ; $rhs = ">>" ; $x = "abc\nabc"; \
> $x =~ s,a(bc),${lhs}$1${rhs},; \
> print Dumper($x);'
$VAR1 = '<<bc>>
abc';
In OCaml we can do the same:
# [%subst {|a(bc)|} / {|<<$1>>|}/re_perl] "abc\nabc" ;;
- : string = "<<bc>>\nabc"
# [%subst {|a(bc)|} / {|<<$1>>|}/pcre] "abc\nabc" ;;
- : string = "<<bc>>\nabc"
or (to refer to local OCaml variables)
# let lhs = "<<" and rhs = ">>" in [%subst {|a(bc)|} / {|${lhs}$1${rhs}|}/re_perl] "abc\nabc" ;;
- : string = "<<bc>>\nabc"
# let lhs = "<<" and rhs = ">>" in [%subst {|a(bc)|} / {|${lhs}$1${rhs}|}/pcre] "abc\nabc" ;;
- : string = "<<bc>>\nabc"
In Perl instead of a string pattern for the right-hand-side of the
substition, we can use a Perl expression to compute the
substitution, in which special variables are be used to access the
capture-groups (NOTE: look for e
in the modifiers):
$lhs = "<<" ; $rhs = ">>" ;
$x = "abc\nabc"; $x =~ s,a(bc),$lhs . $1 . $rhs,e;
and likewise in OCaml:
let lhs = "<<" ;;
let rhs = ">>" ;;
[%subst {|a(bc)|} / {|lhs ^ $1$ ^ rhs|} / e re_perl] "abc\nabc"
[%subst {|a(bc)|} / {|lhs ^ $1$ ^ rhs|} / e pcre] "abc\nabc"
NOTE the difference in the way that capture-groups are named in the
pattern vs. in the expression. This is due to the need to conform to
Camlp5 antiquotation syntax. AND NOTE again the presence of e
in
the modifiers for "expression patterns".
Implicit in Perl’s s/re/pat/
match-and-replace operation is the idea
of a pattern. Such a pattern can be either a string with
antiquotations for variables and capture-groups, or a Perl expression
with antiquotations for capture-groups (since expressions already
include variables). So in OCaml we have a type of "pattern" for this,
and we’ve already seen both kinds just above.
First there are strings with antiquotations for variables and capture-groups:
# [%pattern {|<<$1>>|}/re_perl] ;;
- : Re.substrings -> string = <fun>
# [%pattern {|<<$1>>|}/pcre] ;;
- : Pcre.substrings -> string = <fun>
or
# fun lhs rhs -> [%pattern {|${lhs}$1${rhs}|}/re_perl] ;;
- : string -> string -> Re.substrings -> string = <fun>
# fun lhs rhs -> [%pattern {|${lhs}$1${rhs}|}/pcre] ;;
- : string -> string -> Pcre.substrings -> string = <fun>
and also an expression with antiquotations for capture-groups:
# fun lhs rhs -> [%pattern {|lhs ^ $1$ ^ rhs|} / e re_perl] ;;
- : string -> string -> Re.substrings -> string = <fun>
# fun lhs rhs -> [%pattern {|lhs ^ $1$ ^ rhs|} / e pcre] ;;
- : string -> string -> Pcre.substrings -> string = <fun>
NOTE that just as in Perl s///
, to indicate that the pattern is an
expression, we use the modifier e
. Also note that the type for
substrings
is different when using re_perl
from when using pcre
.
In a string pattern, antiquotations are either ${varname}
or (for
capture groups) $N
(or ${N}
) (where N
is an integer constant).
In an expression variables are already expressible, and capture groups
are expressed as $N$
(where N
is an integer constant).
A pattern that doesn’t have any capture-groups has type string
; a
pattern that does have capture-groups has type Re.substrings -> string
(or Pcre.substrings -> string
) (since those capture-groups
will have to be taken from some already-matched regexp, and a matched
regexp produces a Re.substrings
(or Pcre.substrings
)).
The extensions all have common syntax aspects. Extensions look like:
-
[%match *regexp*]
or[%match *regexp* / *modifiers*]
-
[%split *regexp*]
or[%split *regexp* / *modifiers*]
-
[%pattern *pattern*]
or[%pattern *pattern* / *modifiers*]
-
[%subst *regexp* / *pattern*]
or[%subst *regexp* / *pattern* / *modifiers*]
There are five kinds of modifiers, and different kinds are allowed for different extensions:
-
choice of which regexp backend: allowed for all extensions, and default
re_perl
-
re_perl
: theocaml-re
backend, usingRe.Perl
-
pcre
: thepcre
backend
-
-
regexp compile-time modifiers: allowed for
match
,split
,subst
-
i
: case-insensitive regexp -
s
: treat string being matched as a single line (like Perl/s
) -
m
: treat string being matched as multiple lines (like Perl/m
)
-
m
and s
are mutually-exclusive
-
regexp output modifiers: allowed for
match
,split
-
exc
: raiseNot_found
if the regexp does not match or mandatory capture-groups did not match. -
raw
: return aRe.substrings
-
strings
: return a tuple ofstring option
for each capture-group -
pred
: return a boolean for whether or not the regexp successfully matched
-
raw
, strings
, and pred are all mutually-exclusive. `pred
and
exc
are mutually-exclusive. Also, strings
can take parameters,
which are explained below.
-
pattern modifiers: allowed for
pattern
andsubst
-
e
: the pattern is an OCaml expression, not a string
-
-
substitution modifiers: allowed for
subst
-
g
: apply the substitution to every occurrence of the regexp, not just the first one
-
A regexp, when applied to some input string, can match, or fail to
match. The most primitive result it can produce is a Re.substrings
,
which holds the substrings of the input that matched the capture
groups of the regexp. So the result type of a regexp match should be
Re.substrings option
With the exc
modifier (which causes Not_found
to be raised on
match failure), this becomes Re.substrings
.
To get these result types, we use the modifier raw
. But a
Re.substrings
is a complex object and we might want something more
transparent. A natural thing to want, is a tuple of all the
capture-groups. So let’s consider a regexp: (a)?(b)(c)?
. This
regexp has four capture groups:
-
0
: the entire matched substring -
1
: the substring that matches(a)
-
2
: the substring that matches(b)
-
3
: the substring that matches(c)
If the regexp matches the string input, capture group 0
will be
non-null. But capture groups 1
,3
can be null even if the regexp
matches the string input. Capture group 2
must match if the string
matches, but let’s ignore that for now. The type of the regexp is
# [%match {|(a)?(b)(c)?|}/re_perl] ;;
- : string -> (string * string option * string option * string option) option
= <fun>
# [%match {|(a)?(b)(c)?|}/pcre] ;;
- : string -> (string * string option * string option * string option) option
= <fun>
since
-
it could fail to match (outermost
option
) -
each of the capture groups
1
,2
,3
could fail (otheroption
types)
If we’d prefer to have an exception (Not_found
) on unsuccessful
match, the exc
modifier will do that for us:
# [%match {|(a)?(b)(c)?|} / exc re_perl] ;;
- : string -> string * string option * string option * string option = <fun>
# [%match {|(a)?(b)(c)?|} / exc pcre] ;;
- : string -> string * string option * string option * string option = <fun>
Perhaps we’d like only the second capture group:
# [%match {|(a)?(b)(c)?|} / exc strings 2 re_perl] ;;
- : string -> string option = <fun>
# [%match {|(a)?(b)(c)?|} / exc strings 2 pcre] ;;
- : string -> string option = <fun>
And since in the regexp that capture group must match for the entire
regexp to match, we might want to dispense with the option
:
# [%match {|(a)?(b)(c)?|} / exc strings !2 re_perl] ;;
- : string -> string = <fun>
# [%match {|(a)?(b)(c)?|} / exc strings !2 pcre] ;;
- : string -> string = <fun>
# [%match {|(a)?|} / foo];;
File "_none_", line 1, characters 19-22:
Failure: extract_options: malformed option: <<(MLast.ExLid (<loc>,
(Pp_MLast.Ploc.VaVal "foo")))>>
Line 1, characters 0-0:
Error: Stdlib.Exit