Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow iteration through named subgroups of Capture object #1114

Closed
wants to merge 11 commits into from

Conversation

lclarkmichalek
Copy link
Contributor

Currently there is no way to get the names of the named subgroups from a captures/match object (afaict). This patch allows the user to iterate through the captures object and get all named group names and values.

I've no idea if this is the best API, but even if this patch isn't accepted, it'd be really useful to have some way of getting the names of the named subgroups in a Capture object.

@DmitryOlshansky
Copy link
Member

Yeah, it's an overlooked piece of API, I kind of been wondering when somebody will notice it ;)

Iteration is good but some more straight-forward and direct is needed in addition to convenience.
What about adding a property: aliases or namedCaptures to struct Regex and have it forwarded by Captures.

In the end it's a property of Regex that doesn't change on each match so it's the place to put this stuff.

BTW it's seems customary now to file an enhancement request in bugzilla and reference it in commit message as "fixes Issue xxxx".

@lclarkmichalek
Copy link
Contributor Author

I agree iteration does not seem to be enough; I cringed as I wrote the code. However, I'm not particularly aquatinted with the D standards on stuff like returning a slice vs exposing a range etc, so I thought iteration would be the least offensive method of exposing this information.

You're probably correct that it should be a property of Regex and to be honest, the only reason that I added it to the Captures object is because the name Captures seems more closely related to the concept of named sub-captures.

And yes, I guess I will submit myself to the bureaucracy of bugzilla in future (or would it be better to move this discussion there now?)

@DmitryOlshansky
Copy link
Member

However, I'm not particularly aquatinted with the D standards on stuff like returning a slice vs exposing a range etc, so I thought iteration would be the least offensive method of exposing this information.

The standard is simple - expose an opaque range of the required power unless there is an advantage to expose the concrete type. The key idea is to leave future contributors some room for improvement of internals w/o breaking interface (and user code) e.g. if we expose an array we then give on any of storage tricks inside of Regex.
For instance Captures is exposed (not opaque type) because of replace with delegate that need this type. It's not an array of slice so that I (and others) may enhance its internals later on. See also std.range source for examples of opaque range types.

Bugzilla is mainly for history and discussions on bugs/enhancement until there is a pull that address it.
In that sense adding issue after pull is just leaving a short-note for future change log entry.

Here we discuss implementation of pull so I believe there is no need to move discussion to Bugzilla but an entry for change log is good to have (so the change is announced with new release).

@lclarkmichalek
Copy link
Contributor Author

I see how opaque ranges work now, very elegant. I do have some qualms with moving the named captures range to RegexMatch however:

  • The NamedGroup array that contains the actual data is in the Captures struct, meaning that implementing the named captures range on the RegexMatch struct would require either exposing that array, keeping a duplicate in RegexMatch, or moving it back to RegexMatch
  • The current API for accessing the value in a named capture is also via the Captures struct, meaning that code that only wants the named match will have to look in two different places i.e.
auto m = match(foobar, regex(reg));
foreach (name; m.namedGroups) {
     auto value = m.captures[name];
     writeln(name, " maps to ", value);
}

which is hardly elegant. As changing the existing API is not a good thing to do, I think the best thing to do is to add it to the Captures struct. This is still bad, as it means Captures is becoming significantly more than just a non opaque range, but it's already that way with the opIndex(string) method (imo).

@DmitryOlshansky
Copy link
Member

I do have some qualms with moving the named captures range to RegexMatch however:

Rightfully so what I meant was moving the functionality to Regex template not RegexMatch. Then aliases gives pairs of name --> number of match. (keep in mind named matches are an aliases to numberd ones)

Then about captures - what if we are to make:

auto m = match(foobar, regex(reg));
foreach (c; m) {
  auto withName = c.named; //gives range of pairs - (name, slice) for now Forward should do
  foreach(pair; withName)
   writeln(pair[0]," : ", pair[1]);
}

Soounds good?

UPDATE: Agree about Captures, they are somewhat more then opaque range. And I mean add both things: aliases to Regex struct itself and .named or .aliased property to Captures.

@lclarkmichalek
Copy link
Contributor Author

What type would you suggest for the pair? Tuple!(string, R)? That does particularly seem to be a candidate for opApply to me, for it is no longer adding core functionality (assuming a namedCaptures property somewhere else) and is simply acting as a short cut. Alternatively, it might be easier to just leave it for the moment and just do one thing at a time.

I agree about implementing namedGroups on Regex.

@DmitryOlshansky
Copy link
Member

What type would you suggest for the pair? Tuple!(string, R)?

Yes, that fits the bill. If adventurous enough you may try Tuple!(string, "name", R, "slice") if that works fine.

About opApply AFAICT there is auotmatic tuple unpacking (or I may be daydreaming) that mens that if range returns tuple one can do:

foreach(first, second; range){ .... }

opApply is kinda old, and was devised way before ranges with things like traversal of complex tree-like structures in mind.

That does particularly seem to be a candidate for opApply to me, for it is no longer adding core functionality (assuming a namedCaptures property somewhere else) and is simply acting as a short cut. Alternatively, it might be easier to just leave it for the moment and just do one thing at a time.

Let's settle on this tiny plan: one pull for namedCaptures and (if you are inclined) another one for Captures.named. In a sense the latter it's just a shortcut for filter + map applied to captures.

namedCaptures should go as random access range or at least forward+length (even length alone allows for the some powerful shortcuts in std.algorithm).

@JakobOvrum
Copy link
Member

About opApply AFAICT there is auotmatic tuple unpacking (or I may be daydreaming) that mens that if range returns tuple one can do:

foreach(first, second; range){ .... }

That doesn't work in 2.061, but I think there is an open pull request for it.

Either way, it can be implemented in a library like this:
https://gist.github.com/JakobOvrum/4725281

@lclarkmichalek
Copy link
Contributor Author

Ok, namedCaptures now seems to implement every range under the sun (well, RandomAccess, Forward, Length, and all other ranges implied by those). I wasn't sure 100% how to document which interfaces it implements, as it's not obvious from the method prototype, so I added a static assert(isRandomAccess!(typeof(ng))) to the example. Not sure if that's the correct way to do it, but it should get the message across. I'll open a new pull request for the .named property of the Captures object.

void popBack() { assert(!empty); end--; }
string opIndex()(size_t i)
{
assert(start + i < end, "Request named group is out of range.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: RequestED named group...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm true. I didn't put any messages for most of the asserts in the class, is that ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, other range are too not very verbose with asserts + the upper two won't ever trigger in user code excpet if it extracts type with typeof(namedCaptures).
Yup and to be true DbC fellow add asserts to in{ ... } contracts.

Thinking a bit about name - maybe make it namedGroups since it didn't capture anything yet as in match ;)

@DmitryOlshansky
Copy link
Member

Another minor comment is that I'm surprized that RandomAccess doesn't require slicing ... I thought there was a movement to mandate it, I'll double check std.range.

@DmitryOlshansky
Copy link
Member

Looking good. The last thing is to actually put that example into unittest so that it's ehm tested :)
And throw in some random access like [0] and [1] along the way. Add more exhaustive testing but outside of example as it should be simple.

We seem to have automatic examples in 2.062 beta but it goes the other way around - from documented unittest towards auto-generated example sections in docs, see initial pull here:
dlang/dmd#1342

Let's try the new non-verbose way of testing examples in std.regex with this pull and see how it goes.

@DmitryOlshansky
Copy link
Member

Aaaand it looks like we hit a compiler limitation (bug)? Behold the pull auto-tester in it's full glory :)

It seems like you can't have unittest inside of Regex that relies on the outside function that relies on Regex being fully analyzed ... Sounds like compiler should be able to anyalze struct before analyzing unittests (at least to me) but it doesn't.

In any case the only way out for now seems like duplicate the example in the docs and put unittests outside of Regex (say right after it). Let's do this and finally pull this in, thanks.

@lclarkmichalek
Copy link
Contributor Author

Bump :)

@DmitryOlshansky
Copy link
Member

Yes, you've got to ping reviewers once you've addressed comments :)
Looks like there is unrelated failure on win64.
Then this set of commits would be better squashed to just one.

@DmitryOlshansky
Copy link
Member

Ping !
What's left is a simple bit of work - squashing all of commits to 1 (thus removing all of noisy transitory work like correcting errors and moving tests around).

See e.g. http://blog.steveklabnik.com/posts/2012-11-08-how-to-squash-commits-in-a-github-pull-request

@andralex
Copy link
Member

OK to close this since #1385 took care of it?

@DmitryOlshansky
Copy link
Member

Yup, #1385 is exactly the same commits squashed into one.

@yebblies yebblies closed this Jul 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants