Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functional style regex pattern-matching #1241

Closed
wants to merge 8 commits into from

Conversation

idanarye
Copy link
Contributor

@idanarye idanarye commented Apr 4, 2013

Based on Scala's pattern-matching with regular expressions syntax, I created the function std.regex.regexSwitch(formerly switchRegex). It's main use-case is for reading textual input fields that can be in multiple formats:

    enum WeekDay
    {
        Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday
    }

    struct Schedule
    {
        WeekDay day;
        int hour;
    }

    assert(equal(
                [
                    Schedule(WeekDay.Sunday, 1),
                    Schedule(WeekDay.Monday, 14),
                    Schedule(WeekDay.Tuesday, 3),
                    Schedule(WeekDay.Wednesday, 4),
                    Schedule(WeekDay.Thursday, 17),
                    Schedule(WeekDay.Friday, 6),
                ],
                [
                    "Sunday 1AM",
                    "Monday 2PM",
                    "Tuesday 3",
                    "4AM Wednesday",
                    "5PM Thursday",
                    "6 Friday",
                ].map!(regexSwitch!(
                        `(\w+) (\d+)AM`, (WeekDay day, int hour) => Schedule(day, hour % 12),
                        `(\w+) (\d+)PM`, (WeekDay day, int hour) => Schedule(day, hour % 12 + 12),
                        `(\w+) (\d+)`, (WeekDay day, int hour) => Schedule(day, hour),
                        `(\d+)AM (\w+)`, (int hour, WeekDay day) => Schedule(day, hour % 12),
                        `(\d+)PM (\w+)`, (int hour, WeekDay day) => Schedule(day, hour % 12 + 12),
                        `(\d+) (\w+)`, (int hour, WeekDay day) => Schedule(day, hour),
                        ))())); 

@DmitryOlshansky
Copy link
Member

Overall - the stuff is neat and something I had in mind for a long time.

There is one big concern about the current implementation though.

Doing one by one matching is not good performance-wise.
For K patterns, M inputs: (K times start engine, K times stop engine ) * M.

We could get is just M times of start, stop. (not even speaking that patterns are quite simillar in many cases)

What I planned is the following :

  1. Create 1 big pattern by merging all of the cases in an equivalent of (a|b|c) alternative with few tweaks (internal stuff, not string concatenation, of course).
  2. The we run enigine exactly once and it tells us which pattern matched (returns number of matching sub-pattern).
  3. The merged pattern is cached on first call (just like match with plain string)

So this leaves me wondering if it's readily amendable to what I plan on the implementation side without changing the interface. Looks like it should.

I'll review this pull more throughly but of the top of my head:

check 0 patterns and 1 pattern cases in regexSwitch unittest

  • at least one failure test (argument mismatch etc)..

@idanarye
Copy link
Contributor Author

idanarye commented Apr 4, 2013

@blackwhale How should I go about combining the patterns together? I have next to zero knowledge of regex implementation, but from a quick glance at the source of std.regex I see that most of the implementation is done in Parser, and most of it's methods are for internal use, not for modification Parser structs from outside.

Also, my original design was to get patterns and handlers as template arguments, and use ctRegex to compile the patterns once, on compilation, but it turned out to have a significant compilation slowdown, so instead I moved patterns and handlers to regular arguments, allowing the user of switchRegex to choose their own way of regex compilation(ct, saving regex object as static variable, or sending string to switchRegex which will compile them on the fly). Will your implementation improvement suggestion benefit from going back to template arguments?

@DmitryOlshansky
Copy link
Member

How should I go about combining the patterns together?

Well I don't expect you since to it's not strightforward as there is (sadly) no API for constructing regex from pieces manually. Plus we need to tweak one VM instruction - IR.End to return the number of sub-pattern.

Thus let's focus on the interface and being able to do that later.

Will your implementation improvement suggestion benefit from going back to template arguments?

Hard question as by having them as template arguments we would need to use ctRegex (otherwise it's just causing tempalte bloat all over the place). If we pick them as normal arguments then we need to use normal non-CT patterns.

All in all I'd say it needs both versions if we are to support both ctRegex and Regex properly. Think of it this way - since merging patterns together into a switch produces a new pattern it has the same interface problem as the original regex vs ctRegex.

So regexSwitch and ctRegexSwitch ? (switchRegex sounds like I'm about to hot swap one pattern for another ;) )

@idanarye
Copy link
Contributor Author

idanarye commented Apr 5, 2013

@blackwhale I moved from template arguments to regular arguments to allow things like:

    auto r1=regex(`...`);
    auto r2=regex(`...`);
    auto foo(string str){
        return str.switchRegex(
            r1,(...)=>...,
            r2,(...)=>...,
        );
    }

But since the new implementation is going to create a big compound automaton for all patterns, this is probably redundant - and might even make it harder to implement. I don't want an interface that will allow things that will make it harder to implement your suggestion.

Leaving that behind, the only interface advantage the regular arguments version have is the ability to supply regular expressions and handler delegates in runtime. But this option was limited to begin with, since delegate signature must match the pattern, and anyways if a user of this function makes a mechanism for supplying patterns and handler delegates at runtime, they already loose most of the elegance switchRegex offers.

Now, the template argument version has one clear advantage in terms of elegance - if you want to use a higher order function, or to create an alias, switchRegex with all the patterns and handler delegates is already a delegate, so the example above could be replaced with:

    assert(equal(
                [
                    Schedule(WeekDay.Sunday, 1),
                    Schedule(WeekDay.Monday, 14),
                    Schedule(WeekDay.Tuesday, 3),
                    Schedule(WeekDay.Wednesday, 4),
                    Schedule(WeekDay.Thursday, 17),
                    Schedule(WeekDay.Friday, 6),
                ],
                [
                    "Sunday 1AM",
                    "Monday 2PM",
                    "Tuesday 3",
                    "4AM Wednesday",
                    "5PM Thursday",
                    "6 Friday",
                ].map!(switchRegex!(
                        `(\w+) (\d+)AM`, (WeekDay day, int hour) => Schedule(day, hour % 12),
                        `(\w+) (\d+)PM`, (WeekDay day, int hour) => Schedule(day, hour % 12 + 12),
                        `(\w+) (\d+)`, (WeekDay day, int hour) => Schedule(day, hour),
                        `(\d+)AM (\w+)`, (int hour, WeekDay day) => Schedule(day, hour % 12),
                        `(\d+)PM (\w+)`, (int hour, WeekDay day) => Schedule(day, hour % 12 + 12),
                        `(\d+) (\w+)`, (int hour, WeekDay day) => Schedule(day, hour),
                        ))()));

Notice I removed schedule => schedule. - in shorter examples it might be more noticeable.

As for your concern about template bloat - I don't think the template arguments version will be much "bloatier" than the regular argument version. The regular version already needs to instantiate for every combination of delegate argument types and return values, and I doubt there will be many cases of calls for switchRegex that share the same delegate signatures and return type but have different patterns and delegate code...

All that being said - the template argument version has a clear disadvantage that you need to pass the variadic template argument composed of the delegate directly - it cannot be inferred from the regular variadic argument like in the regular arguments version, because there is no regular variadic argument.. That means that the variadic template argument must be the first template argument. But variadic template argument must also be the last template argument. Therefore, it must be the only template argument, which means we can't have a template argument for the input text's type, so it must always be string.

@DmitryOlshansky
Copy link
Member

Leaving that behind, the only interface advantage the regular arguments version have is the ability to supply regular expressions and handler delegates in runtime. But this option was limited to begin with, since delegate signature must match the pattern, and anyways if a user of this function makes a mechanism for supplying patterns and handler delegates at runtime, they already loose most of the elegance switchRegex offers.

Then what we are trying to achive is to implement siwtch with regex patterns + conversions that some languages have as built-in. If that case then having patterns be constant strings and delegates being passed as alias parameters fits the bill just fine.

That means that the variadic template argument must be the first template argument. But variadic template argument must also be the last template argument. Therefore, it must be the only template argument, which means we can't have a template argument for the input text's type, so it must always be string.

It surely can just make your regexSwitch a template of tempalte, see also std.algortithm map for canonical example:

tempalte regexSwitch(Args...)
{
     auto regexSwitch(Input)(Input input){ ... }
}

This also makes it possible to alias regexSwitch!(...) and then use the alias as callable to simplify code structure.

@idanarye
Copy link
Contributor Author

idanarye commented Apr 6, 2013

OK, I changed it back to template argument version, and renamed it to regexSwitch. I also added those unit tests you suggested.

@DmitryOlshansky
Copy link
Member

Another idea I've got that is very closely related to this one is scanf on steriods. This is what you are already doing but as far as I can tell it's all hardcoded to to!TargetType. This more flexible way could be introduced in the other function/extension say:

auto rangeOfTuples = extract(match(...), <delegates to call on each capture> );

The same could go with UFCS style call. And the simpliest way, of course, is to just cal it with to!Type:

auto rangeOfTuple = extract(match("(\d+) (\w+)", to!int, to!string);

AFAIK to!string of string is the same string (since it's immutable there is no harm done). The good thing about this simplish API is that it doesn't add much of bloat and rides on top of what you did (a bit more generalize) so the code is basically the same and can be re-used.

P.S. Before I forget - Let's do it Walter's Style: open an ehancement, fill in the details of enhancement and include into one of the commit messages "fix issue xyz".

@idanarye
Copy link
Contributor Author

idanarye commented Apr 7, 2013

I don't really understand your "P.S." - where should I open an enhancement? And why should a new feature be labeled as an issue?

Anyways, about that extract function: I don't really see how you want to implement it on top of regexSwitch...

@DmitryOlshansky
Copy link
Member

Oh the joy of posting at 2 A.M. :)

But all in due order.

Yes, Walter wanted that any enhancements being done to be opened as issue in Bugzilla so that it's automatically shown in changelog next time around.

About extract - I've meant that your
public auto callWithCaptures(CapturesType, DelegateType)(CapturesType captures, DelegateType dlg)

is fairly close. I don't want to push you to do it, just noting a similar (in my eyes) functionality.

BTW I do NOT think callWithCaptures is good enough name for public API. Either think it through better or make it private for now.

callWithDelegate converts captures and fires up an action.

Extract just converts and let's user pick what kind of conversion to use. It's a data ---> data conversion, thus suitable as a range adapter.

static assert(__traits(compiles, (match(input, ""))),
"Unable to match on " ~ T.stringof ~ ".");
static assert(Patterns.length % 2 == 0,
"Wrong number of arguments - each pattern must have an handler.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - 'a handler'

Let's proof read the messages as these are the things that what makes people think "fine stuff" vs "some half-baked crap".
(Sadly so the number of typos and minor mistakes is often percieved as lack of quality)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*perceived

Hihi.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't stress that better, huh ? :)

@DmitryOlshansky
Copy link
Member

Otherwise this is LGTM.

And I'm pleased to see this contribution came by.

@idanarye
Copy link
Contributor Author

idanarye commented Apr 8, 2013

Fixed that typo(found a few more in the doc along the way, fixed them too)

The function std.regex.regexSwitch matches a string on multiple regular
expressions, and runs a delegate based on which regex matched the input
string.

The function std.regex.callWithCaptures is a helper function that calls a
delegate with regex captures distributed into it's parameters. It was created
to be used in regexSwitch, but it's generic enough to have other uses so I made
it a separate function.
jmdavis and others added 5 commits June 19, 2013 08:47
This moves std.traits.unsigned to std.conv and adds std.conv.signed,
since for some reason, we have std.traits.unsigned but do not have
std.traits.signed.
Implement issue# 10314 (add std.traits.signed).
Convert C-style array declarations to D-style.
@DmitryOlshansky
Copy link
Member

Needs a rebase.

idanarye added 2 commits July 5, 2013 03:36
The function std.regex.regexSwitch matches a string on multiple regular
expressions, and runs a delegate based on which regex matched the input
string.

The function std.regex.callWithCaptures is a helper function that calls a
delegate with regex captures distributed into it's parameters. It was created
to be used in regexSwitch, but it's generic enough to have other uses so I made
it a separate function.
…bos into add-regex-pattern-matching

Conflicts:
	std/regex.d
@idanarye
Copy link
Contributor Author

idanarye commented Jul 5, 2013

@blackwhale I rebased it, but when I try to squash the commits it always breaks everything. Is it OK if I don't squash them?

@DmitryOlshansky
Copy link
Member

Mmm this is not a rebase as it shows up stray commits and the "Merge ..." stuff .

Typical setup (provided you had clean but outdated state) involves
git pull --rebase upstream master
where upstream remote is configured to point at D-Programming-Language/phobos.git

The simple way out (besides some git kung-fu) is to create a new branch drop the changes here and re-post pull request, linking ot this one.

@idanarye
Copy link
Contributor Author

idanarye commented Jul 5, 2013

OK, here is the new one - #1392

I'm closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants