Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import named regex captures as fish variables #7459

Closed
wants to merge 9 commits into from

Conversation

mqudsi
Copy link
Contributor

@mqudsi mqudsi commented Nov 7, 2020

Add support for importing named regex matches

The new commandline switch string match --regex --import will import
as fish variables any named capture groups with the matched captures as
the value(s).

Example:

~/fish-shell > set spec (identify fish.png)
~/fish-shell > printf "%s\n" "$spec"
fish.png PNG 275x275 275x275+0+0 8-bit sRGB 8.43KB 0.000u 0:00.000
~/fish-shell > string match -rq --import "^(?<file>[^ ]+) (?<fmt>[^ ]+) (?<width>\d+)x(?<height>\d+)" -- "$spec"
~/fish-shell> printf "$file\n * width: $width\n * height: $height\n"
fish.png
 * width: 275
 * height: 275

Copy link
Member

@krobelus krobelus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, this looks really powerful!
I'm not sure why the MacOS tests are failing.

Can we enable this whenever a named capture group occurs in the regex?
This way we don't need the --import flag.

If we ever want to support unnamed captures, we'd probably need the flag again.
($1, $2, ... are open though a single array variable seems more idiomatic).

src/builtin_string.cpp Outdated Show resolved Hide resolved
// We don't have a way of having empty values in the middle of a multi-entry fish array,
// so we compromise by leaving the value unset in the case of !opts.all but assign an
// empty value otherwise. (opts.all is always true if !first_time)
if (!value_found && opts.all) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the behavior with multiple arguments should be.
If we support that we should probably clear the capture variables before any argument is processed.

Currently the behavior is consistent, but always resets to the match of the last argument:

$ build/fish -c '
  string match -rqI "(?<word>[^. ]+) ?(?<punct>[.])?" "dot1.nodot1" "nodot2 dot2"
  set -S word punct
  '
$word: set in global scope, unexported, with 1 elements
$word[1]: |nodot2|
$punct: set in global scope, unexported, with 0 elements

If I remove this part

            if (first_time) {
                // Make sure we clear out any existing values matching the named groups, because
                // we'll be appending instead of overwriting from here on out.
                vars.set_empty(name_entry.name, ENV_DEFAULT);
            }

then I get something that looks inconsistent, because I'm matching against multiple strings but without --all to trigger this insertion of emtpy strings for missing captures.

$ build/fish -c '
  set word; set punct
  string match -rqI "(?<word>[^. ]+) ?(?<punct>[.])?" "dot1.nodot1" "nodot2 dot2"
  set -S word punct
  '
$word: set in global scope, unexported, with 2 elements
$word[1]: |dot1|
$word[2]: |nodot2|
$punct: set in global scope, unexported, with 1 elements
$punct[1]: |.|

I guess we could extend this compromise to cases where multiple arguments are passed,
conceptually (opts.all || aiter.size() > 1).

src/builtin_string.cpp Outdated Show resolved Hide resolved
public:
pcre2_matcher_t(const wchar_t *argv0_, const wcstring &pattern, const options_t &opts,
io_streams_t &streams)
io_streams_t &streams, parser_t &parser_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually member variables have the underscore. So the parameter would be parser and the member variable would be parser_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ridiculousfish I agree, but was sticking to the local convention (see parameter argv0_). Would you like me to change them all?

@ridiculousfish
Copy link
Member

This is awesome! I can see a lot of scripts being simplified by this! I agree with krobelus that we do not need the --import option, we should just always do it!

Please update the string-match.rst docs. In particular we should clarify the behavior on multiple arguments.

An unresolved question is what scope the variables get set in. I don't think anybody wants a bunch of additional scope flags like we have for read, so maybe it always gets set as ENV_LOCAL?

@mqudsi
Copy link
Contributor Author

mqudsi commented Nov 10, 2020

Thanks, guys. I was thinking of things that could be done to improve the correctness of shell scripts, and struck upon this as low-hanging fruit that could provide some limited safety by at least reducing manual indexing and string manipulation, if not also guaranteeing the shape of the resulting variables.

I'm cautiously happy to try making it import by default. I think it could really go some ways to make regex in fish nicer, but with some caveats pertaining to scope and specific variables (more below). I did not think of doing $1, $2, etc. for unnamed groups - that's certainly a thought worth exploring!

In addition to the odd behavior that @krobelus spotted (thanks, I had focused on the behavior of single matching vs non-matching and subsequent --all matching and non-matching, but completely missed --all not matching even the first time around -- my intention is for the same behavior as the single non-matching), there are a few other issues which (in addition to updating the docs and release notes) are part of the reason I opened a PR:

  • Currently the name table is re-enumerated for each result because I was originally shoe-horning this code into report_match() which doesn't have any "big picture" access, but ended up being forced to move it to the matcher itself. Splitting this into an init function that enumerates the names and creates a lambda that can be used in case of a match and in case of no match will clean up both the repeated enumeration and provide a way to clear the vars when no matches are found.
  • I didn't do more than a cursory glance at the PCRE2 spec to see if a valid PCRE2 group name is always a valid fish variable name. I think they are, but we need to verify that (and figure out what to do if they're not).
  • What do we do when an illegal variable name is specified, e.g. version or fish_pid?

I intentionally did not introduce scoping controls and of course @ridiculousfish caught that immediately :) I think we're both in agreement about not wanting to add too many knobs here; I have been giving scopes some thought, and I think something very similar to the default set behavior (function-local unless already extant) makes the most sense as a starting point? This is definitely going off topic, but my "ideal scope" for an interactive scripting language like a shell would be "for writes, universal and global variables are never modified implicitly, and default scope is local unless a variable is previously defined within the function; for reads, all scopes are considered." e.g. very close to the current default function-local behavior, but more conservative within blocks (e.g. let instead of var, in JS-speak).

I don't think hard-coding ENV_LOCAL as the scope would work: I can easily envision cases where you want the variables to be exposed outside the block, e.g.

# consider both with and without the following two lines
set word1
set word2

if test $version -gt 2
    foo | string match -rq "(?<word1>hello) (?<word2>world)"
else
    foo | string match -rq "(?<word1>adios), (?<word2>amigo)"
end

echo $word1 $word2

With a hard-coded local scope, the function-local declarations on line 2 would not help hoist the scope of the variables, right? (Also, ENV_DEFAULT is the same behavior as an unadorned set, right?)

@mqudsi mqudsi force-pushed the regex_import branch 3 times, most recently from 45d07db to 8d0a98e Compare November 10, 2020 16:38
@mqudsi
Copy link
Contributor Author

mqudsi commented Nov 10, 2020

The CI is failing on Linux/Clang and macOS. I've tested both Linux/Clang and macOS/gcc and can't reproduce, they run fine for me. Are those environments using a different PCRE2 or perhaps the system PCRE2?

@faho
Copy link
Member

faho commented Nov 10, 2020

Are those environments using a different PCRE2 or perhaps the system PCRE2?

The second Travis build uses the vendored PCRE2, the rest, best as I can tell, uses system PCRE2. Presumably the system pcre should be the same version across the linux builds at least, so I don't think that's the issue (as there are two successes and one failure with it)

@krobelus
Copy link
Member

krobelus commented Nov 11, 2020

What do we do when an illegal variable name is specified, e.g. version or fish_pid?

Right this currently allows overriding $version!
A fairly simple solution would be to use ENV_USER on vars.set and reuse handle_env_return.
That would suggest refactoring the logic to set each captured variable at most once per string match.
(We could also try to report the error earlier, before matching anything but that's a detail.)

I did not think of doing $1, $2, etc. for unnamed groups

On a second thought, $1 and so on are not necessary at all because we already have this:

set matches (foo | string match -rq "(?<word1>hello) (?<word2>world)")
set word1 $matches[1]
set word2 $matches[2]

The scoping issue is tricky. Probably the current default does the right thing most of the time:
function-local inside functions, otherwise global.

my "ideal scope" for an interactive scripting language like a shell would be "for writes, universal and global variables are never modified implicitly, and default scope is local unless a variable is previously defined within the function; for reads, all scopes are considered."

Agreed, I would also prefer explicit writes for universals and globals - when inside a function. Outside functions, writing globals by default seems fine. Python does this well IMO.

Replace with a class-local `enum class` instance.
The new commandline switch `string match --regex --import` will import
as fish variables any named capture groups with the matched captures as
the value(s).
* Optimize regex variable imports by scanning the name table only once
  per regex.
* Unify the resulting variables in case of unmatching regex both with
  and without `--all` (always empty in case of no matches).
@mqudsi
Copy link
Contributor Author

mqudsi commented Nov 12, 2020

The little-check tests do not report stderr? It turns out I was running into an edge case of maybe_t, holding on to the (still extant) shared_ptr value after the temporary maybe_t was dropped, which triggered a fatal ASAN exception.

I know I've seen ASAN errors reported in the CI previously, but it seems they're silently dropped when little-check is executing.

I've pushed out a refactored version of the code that should address the performance concerns I had as well as fix the divergent behavior of importing variables with and without --all in case of no match. The code is in place to handle and bubble up failed variable name validation, but the validation itself is still not done; the patch still requires an explicit --import though it seems that we have consensus on dropping that so I'll include that in the next update.

set foo foo
echo "foo" | string match -rqI -- '^(?<foo>bar)$'
echo $foo
# CHECK:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect string match to not set any variable when nothing was matched.
I don't know in which situations that would make a difference.

@krobelus
Copy link
Member

The little-check tests do not report stderr

They should, I get an error if I run int foo[2]; return foo[2];

fish-shell/tests > HOME=$PWD ../build-asan/littlecheck.py -sfish=../build-asan/fish checks/regex-import.fish
Failure in checks/regex-import.fish:

  There were no remaining checks left to match stderr:1:
    ../src/fish.cpp:409:12: runtime error: index 2 out of bounds for type 'int [2]'

  when running command:
    ../build-asan/fish checks/regex-import.fish

@mqudsi
Copy link
Contributor Author

mqudsi commented Nov 13, 2020

Hmmm. What if little check errors out on an stdout mismatch first, would it also report any "incidental" stderr output that was generated simultaneously?

@faho
Copy link
Member

faho commented Nov 13, 2020

Let's test!

#RUN: %fish %s

echo foo
# CHECK: foo

echo stderr >&2
# CHECKERR: stderr
echo stderr2 >&2

echo right
# CHECK: wrong

echo stderr3 >&2

Run build_tools/littlecheck.py -s fish=/usr/bin/fish test.fish, and the output is:

Failure in test.fish:

  The CHECK on line 11 wants:
    wrong

  which failed to match line stdout:2:
    right

  additional output on stderr:2:
    stderr2

  Context:
    foo
    right <= does not match 'wrong'
    
  when running command:
    /usr/bin/fish test.fish

So it tells you that there is output on stderr, but only gives you the first line of it.

You can also see this on Travis:

additional output on stderr:1:

    =================================================================

(and yes, asan's first line of output is entirely useless)

We could probably let it show more (all?) of it, or we write it to a logfile and let you check it?

faho added a commit that referenced this pull request Nov 14, 2020
In #7459, asan printed error output. However, because we had a failure
on stdout already, littlecheck would only print the first unmatched
line from stderr, leading to output like

```
additional output on stderr:1:

    =================================================================
```

Which is of course entirely useless.

So in that case we just let it print *all* unmatched stderr lines, so
you'd get the full asan output, which presumably is of more use.

This upgrades littlecheck to 5f7deafcea4e58dd3d369eae069a3781bb6ce75e.
Block import of read-only variables.
Add tests verifying read-only variables cannot be overwritten by regex
import.
@mqudsi
Copy link
Contributor Author

mqudsi commented Nov 25, 2020

Thanks for clarifying the behavior, @faho. I think the problem is the use of line-based string comparison rather than diffing, which means there could (theoretically) be thousands of lines waiting to be matched you wouldn't want spewed to the console just because one stderr line failed to match, whereas with a diff-based approach you could be more certain that all the mismatch is relevant. The little-check diff patches will fix this, right?

I pushed out the (final?) code updates, only documentation and changelog updates remaining I believe.

@krobelus
Copy link
Member

I pushed out the (final?) code updates, only documentation and changelog updates remaining I believe.

Looks good, I don't think there's another blocker left.

If multiple arguments are specified, a named capture will be extracted from the first argument that matches.

@mqudsi mqudsi closed this Nov 26, 2020
@mqudsi mqudsi added this to the fish 3.2.0 milestone Nov 28, 2020
faho added a commit that referenced this pull request Dec 6, 2020
Since `string match` now creates variables, wrapping `string`
necessarily breaks things, so we need to disallow it.

See #7459, #7509.
@faho
Copy link
Member

faho commented Dec 6, 2020

Note: string has been made a reserved word in aa89564, which means it can no longer be used as a function name.

That's because it can't be wrapped correctly anymore after this PR - even with --no-scope-shadowing (and people thinking of using that is a stretch to begin with) it would necessarily break setting argv.

This is in line with set, read and argparse.

@zanchey
Copy link
Member

zanchey commented Jan 7, 2021

The only thing I'm struggling with here is the description of the variables being "imported". To me that sounds like the value of the variable is brought into the match, rather than the results of the match being set externally. Is there a better way of phrasing it? Maybe as "string match automatically sets the results all named capturing groups ((?<name>expression)) as the values of fish variables of the same name"?

@krobelus
Copy link
Member

krobelus commented Jan 7, 2021

"import" sort of makes sense as opposite of "exported" environment variables.
It's still weird because "exported" is used to talk about environment variables, but we import plain shell variables.

So I'm in favor of rebranding, maybe something like this (for string-match.rst and the changelog):

string match integrates with PCRE2 named capture groups: when a regex like (?<name>expression) matches, then the variable name will be set to the substrings that matched expression.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 6, 2021
@mqudsi mqudsi deleted the regex_import branch September 26, 2022 18:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants