diff --git a/regular-expression.dd b/regular-expression.dd index f1e3bf0c02..bf82fa6f64 100644 --- a/regular-expression.dd +++ b/regular-expression.dd @@ -5,128 +5,133 @@ $(D_S Regular Expressions, $(SMALL by Dmitry Olshansky, the author of std.regex ) ) $(H3 Introduction) - $(P String processing is a kind of daily routine that most applications do in a one way or another. - It should come as no wonder that many programming languages have standard libraries equipped + $(P String processing is a daily routine that most applications have to deal with in a one way or another. + It should come as no surprize that many programming languages have standard libraries equipped with a variety of specialized functions for common string manipulation needs. The D programming language standard library among others offers a nice assortment in $(STD string), as well as generic functions from $(STD algorithm) that work with strings. Still no amount of fixed functionality could cover all needs, as naturally flexible text data needs flexible solutions. ) - $(P Here is where $(LUCKY regular expressions) – often just called $(I regexes) – come in handy. - Regular expressions are a simple yet comprehensive language for defining string patterns. - Combined with a mechanism for string substitution, regular expressions are a powerful tool for text processing. - In fact, they are considered so useful that a number of languages provide built-in support - for them. However, built-in support does $(I not) necessarily imply $(I faster) processing - or more features, it is rather a way of providing a - convenient and user-friendly syntax for typical operations and usage patterns. + $(P This is where $(LUCKY regular expressions) come in handy, often succinctly called as regexes. + Regexes are simple yet powerful language for defining patterns for sets of strings. + Combined with pattern matching, data extraction and substituition, they form a Swiss Army knife of text processing. + They are considered so important that a number of programming languages provides built-in support + for regular expressions. Being built-in however does $(B not) necessary + implie $(B faster) processing or having more features. It's just a matter of providing + $(B convenient and friendly syntax) for typical operations, and integrating it well ) $(P The D programming language provides a standard library module $(STD regex). Being a highly expressive systems language, D allows regexes to be $(I efficiently) implemented - within the language itself, yet have the same level of readability and usability as built-in version would provide. - We'll see below how close to built-in regexes look and feel we can achieve. + within the language itself, yet have good level of readability and usability. + And there a few things a pure D implementation adds to the table that are completly unbelivable + in a traditional compiled langauge, more on that at the end of article. ) $(P By the end of article you'll have a good understanding of regular expression capabilities in this library, - and how to utilize its API in a most straightforward way. Examples in this article assume that the reader - has fairly good understanding of regex elements, yet it's not required to understand the API. + and how to utilize its API in a most straightforward and efficient way. Examples in this article assume + that the reader has fair understanding of regex elements, but it's not required. ) $(H3 A warm up) $(P How do you check if something is a phone number by looking at it? ) $(P Yes, it's something with numbers, and there may be a country code in front of that... - Ok, going for an international format should make it more strict. As this is the first time, let's + Sticking to an international format should make it more strict. As this is the first time, let's put together a full program: --- import std.stdio, std.regex; void main(string argv[]) { - string phone = argv[1]; // assuming phone is passed as first argument on the command line - if (match(phone, r"^\+[1-9][0-9]* [0-9 ]*$")) + string phone = argv[1]; // assuming phone is passed as the first argument on the command line + if(matchFirst(phone, r"^\+[1-9][0-9]* [0-9 ]*$")) writeln("It looks like a phone number."); else writeln("Nope, it's not a phone number."); } --- - And that's it! Just keep in mind the boundaries of regular expressions power - to truly establish a + And that's it! Let us however keep in mind the boundaries of regular expressions power - to truly establish a validness of a phone number, one has to try dialing it or contact the authority. ) - $(P Now, come to think of it, this tiny sample showed a lot of useful things already: + $(P Let's drill down into this tiny example, actually it showcases a lot of interesting things: $(UL - $(LI a raw string literal of form r"...", that allows writing a regex pattern in its natural notation ) - $(LI $(D match) function, is an universal tool that we'll see in other work scenarios as well. - To check if there was a match on a string, just test the return value explicitly as boolean. ) - $(LI when trying to match special regex characters like +, *, (, ), [, ] and $ don't forget to use escaping with \ ) - $(LI Unless there is a lot of text processing going on, it's OK to pass a plain string as a pattern. + $(LI A raw string literal of form r"...", that allows writing a regex pattern in its natural notation. ) + $(LI $(D matchFirst) function to find first match in a string if any. To check if there was a match just + test the return value explicitly in a boolean context, such as an $(D if) statment. ) + $(LI When matching special regex characters like +, *, (, ), [, ] and $ don't forget to use escaping with backslash(\). ) + $(LI Unless there is a lot of text processing going on, it's prefectly fine to pass a plain string as a pattern. The internal representation used to do the actual work is cached, to speed up subsequent calls. ) ) ) - $(P Continuing with the phone number example, it could be useful to get the exact value of - country code, as well as the whole number. For the sake of experiment let's also explicitly store + $(P Continuing with the phone number example, it would be useful to get the exact value of + the country code, as well as the whole number. For the sake of experiment let's also explicitly obtain compiled regex pattern via $(D regex) to see how it works. ) --- string phone = "+31 650 903 7158"; // fictional, any coincidence is just that auto phoneReg = regex(r"^\+([1-9][0-9]*) [0-9 ]*$"); - auto m = match(phone, phoneReg); - assert(m); - assert(m.captures[0] == "+31 650 903 7158"); - assert(m.captures[1] == "31"); - // you shouldn't really care about the type of regex object, - // but here it is + auto m = matchFirst(phone, phoneReg); + assert(m); // also boolean context - test for non-empty + assert(!m.empty); // same as the line above + assert(m[0] == "+31 650 903 7158"); + assert(m[1] == "31"); + // you shouldn't need the regex object type all too often + // but here it is for the curious static assert(is(typeof(phoneReg) : Regex!char)); --- $(H3 To search and replace) - $(P There is a frequent need to find all matches in piece of text. - Picking an easy task, let's see how to filter out all white space-only lines. - There is no special routine for looping over input like search() or similar as found in some libraries. - Instead $(D std.regex) provides a natural syntax for looping over all matches via plain foreach. + $(P While getting first match is a common theme in sting validation, another frequent need is + to extract all matches found in piece of text. Picking an easy task, let's see how to + filter out all white space-only lines. There is no special routine for looping over input + like search() or similar as found in some libraries. + Instead $(D std.regex) provides a natural syntax for looping via plain foreach. ) --- auto buffer = std.file.readText("regex.d"); - foreach (m; match(buffer, regex(r"^.*[^\p{WhiteSpace}]+.*$","gm"))) + foreach (m; matchAll(buffer, regex(r"^.*[^\p{WhiteSpace}]+.*$","m"))) { - writeln(m.hit); // hit is a shorthand for m.captures[0] + writeln(m.hit); // hit is an alias for m[0] } --- - $(P It just works as the result of match is an input range. - An element type is Captures for the string type being used, it is a random access range of submatches. - I won't go into full detail of the range concept, suffice to say, that it's a primary way of representing sequences of data in D. - Random access here implies one can peek at any element with direct indexing. To have a feeling of how this affects the API: + $(P It may look and feel like a builtin but it just follows the common conventions to do that. + In this case matchAll returns and object that follows the right "protocol" of an input range + simply by having the right set of methods. An input range is a lot like iterator found + in other langauges. Likewise the result of matchFirst and each element of matchAll + is a random access range, a thing that behaves like a "view" of an array. ) --- - auto m = match("Ranges are hot!", r"(\w)\w*(\w)"); // at least 3 "word" symbols - assert(m.front[0] == "Ranges"); + auto m = matchAll("Ranges are hot!", r"(\w)\w+(\w)"); // at least 3 "word" symbols + assert(m.front[0] == "Ranges"); // front - first of input range // m.captures is a historical alias for the first element of match range (.front). assert(m.captures[1] == m.front[1]); - auto captures = m.front; - assert(captures[2] == "s"); - foreach (c; captures) - writeln(c); // will show lines: Ranges, R, s + auto sub = m.front; + assert(sub[2] == "s"); + foreach (item; sub) + writeln(item); // will show lines: Ranges, R, s --- $(P By playing by the rules $(STD regex) gets some nice benefits in interaction with other modules e.g. - this is how one could count non-empty lines in text buffer - (just don't forget to import $(STD algorithm) when trying that at home): + this is how one could count non-empty lines in text buffer: ) --- + import std.algorithm, std.file; auto buffer = std.file.readText("regex.d"); int count = count(match(text, regex(r"^.*\P{WhiteSpace}+.*$","gm"))); --- - $(P This by the way tells me that $(STD regex) has 7128 non-blank lines as of this writing. But let's get back to the regular expression itself. - A seasoned regex user catches instantly that Unicode properties are supported with perl-style \p{xxx}, to spice that all of Scripts and Blocks are supported. - Let us dully note that \P{xxx} means not having an xxx property, i.e here not a white space character. - Refer to the Unicode standard $(LINK2 http://Unicode.org/reports/tr18/, UTS 18) for full list and specific details. - ) - - $(P Another thing of importance is the option string - "gm", where g stands for global and m for multi-line mode. - While global is obvious in this context, what multi-line mode does deserves some explanation. + $(P This by the way tells me that $(STD regex) has 7128 non-blank lines as of this writing. + But let's get back to the regular expression itself. + A seasoned regex user catches instantly that Unicode properties are supported with perl-style \p{xxx}, + to spice that all of Scripts and Blocks are supported as well. Let us dully note that \P{xxx} means not + having an xxx property, i.e here not a white space character. A Unicode is vital subject to know, and it won't suffice + to try to cover it here. For details see level 1 of conformance as per Unicode standard + $(LINK2 http://Unicode.org/reports/tr18/, UTS 18). ) - $(P Historically utilities that supported regex patterns (unix grep, sed, etc.) processed text line by line. - At that time anchors like ^, $ meant the start of the whole input buffer that has been same as that of line. + $(P Another thing of importance is the option string - "m", where m for multi-line mode. + Historically utilities that supported regex patterns (unix grep, sed, etc.) processed text line by line. + At that time anchors like ^, $ meant the start of the whole input buffer that has been same as that of the line. As regular expressions became more ubiquitous the need to recognize multiple lines in a chunk of text, - with anchors ^ & $ matching before and after line break literally. For the curious, line break is defined as - (\n | \v | \r | \f | \u0085 | \u2028 | \u2029 | \r\n). Needless to say, one need not to turn on multi-line mode if not using any of ^, $. + with anchors ^ & $ matching before and after line break literally. For the curious, modern (Unicode) line + break is defined as (\n | \v | \r | \f | \u0085 | \u2028 | \u2029 | \r\n). + Needless to say, one need not to turn on multi-line mode if not using any of ^, $. ) $(P Now that search was covered, the topic suggest that it's about time to do some replacements. @@ -134,20 +139,19 @@ $(D_S Regular Expressions, ) --- auto text = readText(...); - auto replaced = replace(text, - regex(r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})","g"), "$(DOLLAR)3-$(DOLLAR)1-$(DOLLAR)2"); + auto replaced = replaceAll(text, r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})".regex, "$3-$1-$2"); --- - $(P The "all dates" part comes from the fact that regex has global option set, omit it to replace only the first occurrence. - As seen the replacement is controlled by format string not unlike one in C's famous printf. + $(P The $(D "pattern".regex) is just another notation of writing $(D regex("pattern")) called UFCS that some + may find more slick. As can be seen the replacement is controlled by format string not unlike one in C's famous printf. The $1, $2, $3 substituted with content of sub-expresions. Aside from referencing sub-matches, one can include the whole part of input preceding the match via $$(BACKTICK) and $' for the content following right after the match. ) $(P Ok, now let's aim for something bigger, this time to show that $(STD regex) can do things that are unattainable by classic textual substitution alone. Imagine you want to translate a web shop catalog so - that it display prices in your currency. Yes, one can use calculator or estimate it, once having current ratio. - Being programmers we can do better, so let's wrap up a simple program that converts text to use correct prices everywhere. - For the sake of example let it be UK pounds and US dollars. + that it displays prices in your currency. Yes, one can use calculator or estimate it in his/her head, + once having current ratio. Being programmers we can do better, so let's wrap up a simple program that + converts text to use correct prices everywhere. For the sake of example let it be UK pounds and US dollars. ) --- import std.conv, std.regex, std.range, std.file, std.stdio; @@ -156,7 +160,7 @@ $(D_S Regular Expressions, void main(string[] argv) { immutable ratio = 1.5824; // UK pounds to US dollar as of this writing - auto to_dollars(Captures!string price) + auto toDollars(Captures!string price) { real value = to!real(price["integer"]); if (!price["fraction"].empty) @@ -164,32 +168,34 @@ $(D_S Regular Expressions, return format("$%.2f",value * ratio); } string text = std.file.readText(argv[1]); - auto converted = replace!to_dollars(text, + auto converted = replaceAll!toDollars(text, regex(r"£\s*(?P[0-9]+)(\.(?P[0-9]{2}))?","g")); write(converted); } --- $(P Getting current conversion rates and supporting more currencies is left as an exercise for the reader. - What at work here is so-called replace with delegate, analogous to a callout ability found in other languages and regex libraries. - The magic is simple: whenever replace finds a match it calls a user supplied callback on the captured piece, - then it uses the return value as replacement. And so on for other matches, given the "g" flag of regex. + What at work here is so-called replace with delegate, analogous to a callout ability found in other languages + and regex libraries. The magic is simple: whenever replace finds a match it calls a user supplied callback + on the captured piece, then it uses the return value as replacement. ) - $(P I couldn't resist to spice this example up with yet another feature - named groups. - Names work just like opaque aliases for numbers of captured subexpressions. - Meaning that with the same exact regular expression one could as well change these lines to: + $(P And I can't just resist to spice this example up with yet another feature - named groups. + Names work just like aliases for numbers of captured subexpressions. + Meaning that with the same exact regular expression one could as well change affected lines to: --- real value = to!real(price[1]); if (!price[3].empty) value += 0.01*to!real(price[3]); --- - Though that lacks readability and is not future proof. Also note that optional captures are still represented, - it's just they can be an empty string if not matched. + Though that lacks readability and is not as future proof. ) - $(H3 More features, more options) + $(P Also note that optional captures are still represented, it's just they can be an empty string if not matched. + ) + + $(H3 Split it up) - $(P As core functionallity was already presented, let's move on to extras. + $(P As core functionallity was already presented, let's move on to some extras. Sometimes it's useful to do almost the opposite of searching - split up input using regex as separator. Like the following sample, that outputs text by sentences: ) @@ -198,31 +204,33 @@ $(D_S Regular Expressions, writeln(sentence); --- $(P Again the type of splitter is range, thus allowing foreach iteration. - Notice the usage of lookaround in regex, it's a neat trick here as - stripping off final punctuation is not our intention. Breaking down this example, (?<=[.?!]) part looks behind for first ., ? or !. + Notice the usage of lookaround in regex, it's a neat trick here as stripping off final punctuation is + not our intention. Breaking down this example, (?<=[.?!]) part looks behind for first ., ? or !. This get us half way to our goal because \s* also matches between elements of punctuation like "?!", so a negative lookahead is introduced $(I inside lookbehind) to make sure we are past all of the punctuation marks. Admittedly, barrage of ? and ! makes this regex rather obscure, more then it's actually is. Observe that there are no restrictions on contents of lookaround expressions, one can go for lookahead inside lookbehind and so on. - However in general it's recommended to use them sparingly, keeping them as weapon of last resort. + However in general it's recommended to use them sparingly, keeping them as the weapon of last resort. ) - $(P Now for something completely different yet exactly the same. For one, there is an ability - to precompile constant regex at compile-time: + $(H3 Static regex) + $(P Let's stop gonig for features and think perfromance. And D has something to offer here. + For one, there is an ability to precompile constant regex at compile-time: ) --- static r = regex("Boo-hoo"); assert(match("What was that? Boo-hoo?", r)); --- $(P Importantly it's the exact same Regex object that works through all of the API we've seen so far. - It's just takes less then 1 μs of run-time to initialize. - Run-time version took around 10-20 μs on my machine, - keep in mind that it was a trivial pattern. - ) + It's just takes next to nothing to intialize, just copy over ready-made structure from the data segement. + + Roughly ~ 1 μs of run-time to initialize. Run-time version took around 10-20 μs on my machine, keep in mind that it was a trivial pattern. + $(P Now stepping futher in this direction there is an ability to construct specialized - native machine code that matches a given regex to speed up matching. - All of this at compile time, and again the module usage stays consistent! + D code code that matches a given regex and compile it instead of using default run-time engine. + Isn't it so often the case that code starts with regular expressions only to be later re-written + to heaps of scarry-looking code to match speed requirements. No need to rewrite - we got you covered. ) $(P Recalling the phone number example: ) @@ -235,17 +243,20 @@ $(D_S Regular Expressions, assert(m.captures[0] == "+31 650 903 7158"); assert(m.captures[1] == "31"); --- - $(P In this particular case it matched roughly 50% faster for me though I haven't done comprehensive analysis of this case. - Not to spoil the excitement, but as of this writing the feature is experimental and temporarily lacks some extensions. - Continuing the gloomy trend, be warned that compiler might choke on it, and it's also less tested. - That being said, there is no doubt ctRegex facility will only improve over time. - So start with run-time version, then go compile-time if willing to experiment or - when need to squeeze out some more performance. + $(P Interstingly it looks almost exactly the same (a total of 5 letters changed), yet it does all of + the hard work - generates D code for this pattern, compiles it (again) and masquarades under the same API. + Which is the key point - the API stays consistent, yet gets us the significant speed up we sought after. + This fosters quick iteration with the $(D regex) and if desired a decent speed with $(D ctRegex) in the release build (at the cost of slower builds). + ) + $(P In this particular case it matched roughly 50% faster for me though I haven't + done comprehensive analysis of this case. + That being said, there is no doubt ctRegex facility will immensly improve over time, we only scratch the surface of posibilities here. + More data on real-world patterns, performance of CT-regex and other benchmarks$(LINK2 https://github.com/DmitryOlshansky/FReD, here). ) $(H3 Conclusion) $(P The article represents a walkthrough of $(D std.regex) focused on showcasing the API. - By following a series of easy yet meaningful tasks, its features are exposed in combinations, + By following a series of easy yet meaningful tasks, its features were exposed in combination, that underline the elegance and flexibility of this library solution. The good thing that not only the API is natural, but it also follows established standards and integrates well with the rest of Phobos. @@ -254,15 +265,15 @@ $(D_S Regular Expressions, $(UL $(LI Fully Unicode-aware, qualifies to standard full level 1 Unicode support) $(LI Lots of modern extensions, including unlimited generalized lookaround. - It makes porting regexes from other libraries a breeze) - $(LI Lean API that consists of a few flexible tools: $(D match), $(D replace), $(D splitter) ) - $(LI Uniform and powerful, with unique abilities like precompiling regex or using ctRegex - specially tailored engine at compile-time. All of this fits within the common interface without a notch.) + That makes porting regexes from other libraries a breeze) + $(LI Lean API that consists of a few flexible tools: $(D matchFirst)/$(D matchAll), $(D replaceFirst)/$(D replaceAll) and $(D splitter).) + $(LI Uniform and powerful, with unique abilities like precompiling regex or generating + specially tailored engine at compile-time with ctRegex. All of this fits within the common interface without a notch.) ) - $(P The format of this article is intentionally more of an overview, as such it doesn't stop to talk in-depth about - certain capabilities like case-insensitive matching (via simple casefolding rules of Unicode), - backreferences, lazy quantifiers. - And even more features are coming, as there are interesting opportunities not exploited. + $(P The format of this article is intentionally more of an overview, it doesn't stop to talk in-depth about + certain capabilities like case-insensitive matching (simple casefolding rules of Unicode), + backreferences, lazy quantifiers. And even more features are coming to add more expressive power + and reach greater performance. ) )