Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want variable string manipulation #156

Closed
lilyball opened this Issue Jun 19, 2012 · 96 comments

Comments

Projects
None yet
@lilyball
Copy link
Contributor

commented Jun 19, 2012

I'd really like to see a way to do basic string manipulation of variables, similar to how bash provides substring offsets, pruning of prefix/suffix patterns, and even regular expressions. Having to spawn a separate program just to dice up one string is rather slow and awkward.

@xiaq

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2012

+1 for this. There seems to be even no way to test whether a string matches a wildcard except using switch ... case. Tcl's string builtin (doc) might be of reference value.

@MaskRay

This comment has been minimized.

Copy link

commented Dec 14, 2012

+1. test(1) in coreutils can not handle glob and regex. Fish needs some string manipulation functions.

@merlinthered

This comment has been minimized.

Copy link

commented Mar 21, 2013

Yes, please implement regular expression matching! It will make working with them so much easier! Right now you have to use grep, which is kind of awkward for simple use cases like "does this string start with a number?".

@fkeet

This comment has been minimized.

Copy link

commented Jul 25, 2013

+1

@leafonsword

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

+1

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

The main challenge here is finding a nice syntax that is compatible with fish's command-argument format. It would be nice if we could import a command set wholesale from some other language (Tcl? Lua? Windows PowerShell?). Suggestions are very welcome.

@maxfl

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

A possible solution is to implement something like member access operator (like ':' from lua). It can by only syntactic sugar like this:
'some string'.split(args)
to call
(split 'some string' args).

Some thoughts about it:

  1. It's simple and allows to use not only builtin string manipulation functions but dozens of installed applications.
  2. It can work with strings, variables, arrays, command output
    two\ words.split()
    $variable[1].split()
    $array.split() # $array[0].split() array[1].split() ...
    (ls).join().match('pattern')
  3. split command is also accessible as builtin:
    split 'some long string' | grep long

The problem again is to find some nice character which will not be ambiguous with current syntax.

@xfix

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

echo 'some string' | fish_split ' ' # or cleave, but...

Something like that, perhaps (also, perhaps involving <<< instead of piping echo, but I don't think that matters much).

Also, 'some string'.split(args) doesn't look like shell language at all. Besides, while this case is mostly unambiguous, what would you do about (some program).split. I'm sure that .split could be a file extension, or something.

@maxfl

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

@glitchmr '.' was used only as example and can not be used in real life. I've pointed out that some other character(s) should be used to avoid ambiguity.

I like member-acees like syntax, because it allows to modify the commandline without moving cursor back and forward to put a brace:
somecommand 'some very long string'
is much easier to modify to
somecommand 'some very long string'.split()
or
somecommand 'some very long string'@split()
than to
somecommand (echo 'some very long string' | split)

@wwwjfy

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

I think that just make things complicated. I agree more on the Tcl alike proposal.
The pipe and new syntax and be avoided, because string is just a builtin function

@maxfl

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

The behavior I've described can be also achieved by special function 'apply':
somecommand 'some very long string'(apply split)
causing 'apply' to get the string just before the command substitution and passing as argument to split, as if one calls:
somecommand (split 'some very long string')

In this approach 'apply' can be aware about arrays and can call more complex commands by choosing a place, where the argument can be put:
somecommand 'some very long string'(apply split \$argv)
somecommand $array(apply -a join \$argv) # to swallow the whole array.

@xfix

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

apply is awful hack that simply couldn't work (it does stuff that user generated functions simply cannot do). If we do that, why not go with something like (builtin join ":" $PATH).

@maxfl

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

@glitchmr, because this way doesn't require moving cursor back and forth to insert brackets, pipes and other stuff and allows simple token manipulation: you need to add the code only from the one side of a token, but no to wrap it.
I personally find somecommand (echo $something | split) painful.

apply, if it's added would not be the user generated function, but a builtin, and it's not more the hack then and, or and not builtins that are parsed in a special way as well.

@xfix

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

Other than and parsing stuff like begin or if, I don't see it being parsed specifically. It just make the thing after it a command. However, your proposed apply would read thing before command itself.

@maxfl

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

'and' and commands like this affect the color of the next token (check 'and echo' and 'sudo echo'), that is done internally when parsing the command line. In some sense it's a hack, but useful one.
And what is the problem to read the token before the command substitution if it will simplify the scripting while keeping the syntax clean?

@wwwjfy

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

@maxfl
If no necessary to hack, don't. In every language/syntax, and/or/not is a must.
And it takes extra steps to tweak code in every language, if you want to make inserting brackets faster, you really need to find a plugin for your editor.

And IMO, the .split or @split syntax is not compatible with the existing syntax style, it just doesn't fit. (no offense, but I think the (apply ...) style is ugly, no language uses it AFAIK, I'm glad to know if there is any)

Least surprise is kind of major principle. If not a must, I think we shouldn't introduce any new syntax or variable, even function.

It's pointless to argue here. I think we just propose the candidates, and I believe the authors will consider them well and have a good implementation. Or just implement selves' plan and send pull requests.

@maxfl

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2013

@wwwjfy, thats fair. Every language has to find some balance between keeping only 'a must' and usability: adding string manipulation is not 'a must' at all, you can still use the current syntax and a set of system commands.

I'm not insisting on my options, but I hope someone will find something better than ugly and painful somecommand (echo $something | split) that perfectly fits current syntax.

@xixixao

This comment has been minimized.

Copy link

commented Dec 20, 2013

+1 - I can't even tell if an argument starts with a dash without using grep...

@lloeki

This comment has been minimized.

Copy link

commented Mar 4, 2014

What about new keywords? Other shells have a neat alternative that set a variable to a new, transformed value we can take inspiration from, such as printf -v var_name ....

set-ting a var is already a keyword and theres already contains to find words in an array, so that's consistent. So we could have say pattern:

pattern [-v var_name] OPTIONS PATTERN VALUE [REPLACEMENT]

The -v flag is of utmost importance to not have to capture the result with command substitution.

Without -v: returns 0 if it matches, 1 otherwise.
With -v: sets var_name and replaces the match by REPLACEMENT. Returns 1 if no replacement made.

Matches would work like globbing wildcards for the sake of consistency. The only requirement is to specify an anchor (--start, --end, default: anywhere) and an eagerness (--longest i.e eager, default: lazier i.e shortest), working like bash variable expansion used with /, //, %, %%, # and ##.

So that's it, no contrived variable expansion since @ridiculousfish seems to not like that, but consistency with the rest of the language and simplicity, leading to least surprise, obviousness and discoverability. Also, no dependencies on an external system.

Alternatively the dual set/test functionality of pattern could be embedded in set and test builtins themselves, or split into match and replace builtins.

As for full regexes, I'd make them stand apart in a regex keyword (with the same -v flag), possibly leveraging a well-known implementation such as PCRE or Oniguruma.

@lloeki

This comment has been minimized.

Copy link

commented Mar 4, 2014

I wasn't being clear in the above comment, I made it more complete.

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Mar 4, 2014

@lloeki Very interesting ideas. I like the idea of combining match with replace. It reminds me of list comprehensions from Python.

What's the advantage of passing the variable name instead of using command substitutions?

set -l var (pattern foo $var bar)

vs

pattern -v var foo $var bar

It doesn't seem to be a huge difference, but maybe I'm missing something.

I'm not opposed to passing the name of a variable to set, but one of the downsides is that you have to duplicate some of the options of set, to specify the variable scope. See for example how read duplicates set options like -l, -g, -U.

@lloeki

This comment has been minimized.

Copy link

commented Mar 4, 2014

@ridiculousfish Good point regarding -l, -g, -U.

I don't know how you handle this internally, but to me the most important point would be performance. Intuitively, it seems there is an intermediary step (pseudo-fork? result copying? string allocation?) that when skipped could give more leeway to optimisation.

Maybe I miss something and you're doing some magic that makes them equivalent, or I'm too used to seeing $(basename $(dirname $foo)) in bash/zsh kill the performance.

A second point would be symmetry, but I'm on the fence: neither pattern nor pattern -v would output something to be captured, similarly to contains. Indeed if we want to use set -l var (pattern foo $var bar) then surely if pattern -v foo $var bar outputting nothing might feel weird, especially if one wants to use if pattern -v var foo $var bar to assign and branch in one quick swoop. Come to think of it, a possible alternative would be:

  • pattern foo $var output nothing and returns 0|1
  • pattern foo $var bar outputs the replaced value and returns 0|1
  • pattern -v var foo $var bar outputs nothing and returns 0|1

A bit out of topic, but interestingly the first one might make for a replacement to bash's wildcard matching with =, depending on the implementation of wildcard matching in pattern. Still, I think this last one is worthwhile to have in test, for consistency with switch/case.

Well, that's some food for thought.

@mkaito

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2014

Is there actually a way to do something like ${file%.*} in fish, right now?

@lilyball

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2014

@mkaito AFAIK, not without shelling out to another program (e.g. sed or even bash).

@mkaito

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2014

Alright, thanks. On a related note, where should I be asking this kind of question, as to not pollute the issue tracker?

@lilyball

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2014

@mkaito The fish-users ML would be a good place.

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Apr 6, 2014

It's also fine to post to the issues list here.

@mamiu

This comment has been minimized.

Copy link

commented Aug 30, 2014

If the topic is still active, when will the idea suggested by @lloeki be implemented? IMHO string manipulation is really important for scripting.

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Jul 21, 2015

@lloeki Relying on arity is elegant but tricky, because the user may pass in an empty variable. So it would have to distinguish explicitly zero arguments from an empty argument list after expansion.

A separate possibility is for string to check whether its stdin is from a pipe, and read from stdin if so. The read builtin does this: it reads non-interactively if it's on the receiving end of a pipe. This is very simple to do.

@msteed fish generally supports --, and it's hard to remember to use it; this is one of the longstanding warts of Unix (rm * and the like). But it's especially important with string since string will get a lot of non-literal input in its argv (if you have literal input, you don't need to join or split it).

Another possible future direction is an "argument pipe", basically xargs:

echo 1\n2\n3 $| string --join :

which would be syntactic sugar for

string --join : (echo 1\n2\n3)

sort of like Haskell's $ operator but in reverse.

@lloeki

This comment has been minimized.

Copy link

commented Jul 22, 2015

@lloeki Relying on arity is elegant but tricky, because the user may pass in an empty variable. So it would have to distinguish explicitly zero arguments from an empty argument list after expansion.

Tricky? I think fish can manage this :-)

#!/bin/bash

function foo() {
    [[ "$#" -eq 0 ]] && echo "count zero"
    [[ -z "$1" ]] && echo "length zero"
}

echo 1
foo "42"
echo 2
foo ""
echo 3
foo
@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Jul 22, 2015

@lloeki Ah, but now try:

> foo $nope
count zero
length zero

bash cannot distinguish between a literal-no-arguments and an after-expansion-no-arguments.

@lloeki

This comment has been minimized.

Copy link

commented Jul 22, 2015

foo "$nope" works though, and this is what one is supposed to write in bash (and is duly notified by the awesome shellcheck).

I went to bash because that's what I had right now, but I thought fish was taking care of this and that $foo and "$foo" were supposed to be the same. So you made me install fish on this machine (how sneaky of you! ;-))

#!/usr/bin/env fish

function foo
    test (count $argv) -eq 0; and echo "count zero"
    test -z $argv[1]; and echo "length zero"
end

echo 1
foo "42"
echo 2
foo ""
echo 3
foo
echo 4
foo $nope
echo 5
foo "$nope"
set yep ""
echo 6
foo $yep
echo 7
foo "$yep"

To my surprise, fish behaves like bash on test case 4. So the difference between bash and fish is only with test case 6, whereas I expected it to be different on test case 4 too. My bad. Since I only write bash scripts with set -u, I have the habit of only dealing with set variables, which muddled my thoughts on the issue, sorry (BTW, does fish have something like set -u?)

@msteed

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2015

I am inclined to make string read its arguments from stdin if stdin is not a tty. That avoids the problem where string hangs due to an accidentally empty argv, and it is less cumbersome than requiring a switch.

Taking arguments from stdin helps a little with the ambiguity around -, since everything from stdin can be treated as a non-option argument. Until there is a more general solution to that problem, it seems like -- is the best we can do.

@msteed

This comment has been minimized.

Copy link
Contributor

commented Jul 23, 2015

Here is an update on the proposed syntax. Detailed docs including exit codes and reading strings from stdin at https://github.com/msteed/fish-shell/blob/string-builtin/doc_src/string.txt.

 string length [(-q | --quiet)] [--] [STRING...]
 string sub [(-s | --start)] [(-l | --length)] [(-q | --quiet)] [--] [STRING...]
 string split [(-m | --max) MAX] [(-r | --right)] [(-q | --quiet)] [--]
              SEP [STRING...]
 string join [(-q | --quiet)] [--] SEP [STRING...]
 string trim [(-l | --left)] [(-r | --right)] [(-c | --chars CHARS)]
             [(-q | --quiet)] [--] [STRING...]
 string escape [--] [STRING...]
 string match [(-m | --max) MAX] [(-i | --ignore-case)] [(-r | --regex)]
              [(-n | --index)] [(-q | --quiet)] [--] PATTERN [STRING...]
 string replace [(-m | --max) MAX] [(-i | --ignore-case)] [(-r | --regex)]
                [(-q | --quiet)] [--] PATTERN REPLACEMENT [STRING...]

I appreciate any feedback.

@msteed

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2015

@ridiculousfish, @lloeki, @kballard
Notes on regex engines:

  • re2
    • easy C++ api
    • installed size 502KB
    • supports Perl-style syntax with some features omitted
    • operates only on UTF-8 or Latin1 strings
    • BSD license
  • pcre or pcre2
    • full-featured and widely used
    • under steady development since 1997 (new development only in pcre2)
    • installed size 3.3MB (pcre) or 2.8MB (pcre2)
    • has interfaces for UTF-8, UTF-16, and UTF-32 strings
    • pcre and pcre2 APIs differ significantly
    • BSD license
  • ICU
    • full-featured and proven but big and (can be) slow
    • installed size 33.9MB (includes more than regex)
    • operates internally on UTF-16 strings but can convert to & from UTF-32
    • custom license, BSD-like
  • Oniguruma
    • deals with many encodings including UTF-16 and UTF-32
    • used in PHP for multi-byte regex
    • sporadic updates
    • installed size 543KB
    • BSD license
  • Onigmo
    • based on the Oniguruma engine
    • focus seems to be on Ruby support (used in Ruby 2.0)
    • less widely distributed than the others (e.g., I could not find an Arch Linux package)
    • BSD license
  • boost::regex
    • relies on ICU for full Unicode support
    • easy to move to C++-11
    • custom license, BSD-like

All the libraries except re2 deal with catastrophic backtracking by letting the caller limit memory usage; re2 omits the features that can lead to catastrophic backtracking.

The pcre, Oniguruma, and Onigmo engines offer the best wchar_t support, and re2 offers none.

The advantage that boost offers is the mostly C++-11-compatible API. If we want to use the full-featured ICU engine, boost is the way to go.

If size is a concern, Onigmo or Oniguruma are the clear winners, followed by pcre. (I am not considering re2 a contender because of the lack of wchar_t support.)

License is not really a differentiator.

In my opinion, pcre2 offers the best balance of Unicode friendliness, rich features, wide availability, and long-term viability.

@zanchey

This comment has been minimized.

Copy link
Member

commented Jul 26, 2015

The only problem I can see with pcre2 is that it is so new that it's not available on many of the platforms we support - RHEL/CentOS 5/6/7, Debian 6/7/8, Fedora 21/22, openSUSE 13.1/13.2, or OS X (sans Homebrew or other third-party packages). Mind you, OS X doesn't have any of the options listed, so fish will need to vendor the source for at least some configurations.

pcre2 is available in Homebrew and Fedora 23.

@lloeki

This comment has been minimized.

Copy link

commented Jul 26, 2015

My understanding was that things were learning more towards vendoring the dependency so that it gets built alongside fish (whether it is by source or by way of git submodules, that I don't venture to say).

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Jul 26, 2015

What does "vendoring the dependency mean?" I had assumed we would integrate the source directly but maybe that was baseless.

@lloeki

This comment has been minimized.

Copy link

commented Jul 26, 2015

What does "vendoring the dependency mean?"

Include (a version of) the third-party dependency's source code into the SCM alongside your own, typically in a dedicated directory. I don't know if the name comes from there but Rails conventionally uses a vendor directory for that, which was used to store non-app assets and gems before the advent of rubygems.org and bundler. Golang outright pushes for that and numerous debates have been had about the merits and drawbacks.

@msteed

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2015

I am almost ready to start on the regex functionality. Any objection to using pcre2 then?

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Jul 31, 2015

Go for it. We should measure the binary size before and after though and make sure that we're getting stripping and dead code elimination like we expect.

@msteed

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2015

After adding the pcre2 calls to do regex string matching, plus the code for the other string subcommands, the increase in the size of the fish binary is 178KB (x86/64). Is that okay?

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Aug 2, 2015

Yes, seems OK. I and others can put it on a diet 😀 Thank you for measuring!

@msteed

This comment has been minimized.

Copy link
Contributor

commented Aug 4, 2015

Does fish make any assumptions about sizeof(wchar_t)? I'm looking at whether I need to make the pcre2 calls work with both 16-bit and 32-bit wide characters.

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Aug 4, 2015

fish assumes wchar_t is 32 bits (*nix) or 16 bits (Windows).

@msteed

This comment has been minimized.

Copy link
Contributor

commented Aug 8, 2015

Regex string match is working, leaving only regex string replace to be done.

A couple outstanding questions:

  1. What should string split do with an empty separator? Possibilities: do no splitting, split individual characters, or fail. Currently it does no splitting, but I'm inclined to change it to split individual characters.
  2. In configure.ac there is a check for regex.h but that header is not used anywhere. Should I remove the check?
@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Aug 9, 2015

  1. Split to all individual characters is right
  2. Yes, thanks!

@faho faho modified the milestones: next-2.x, fish-future Oct 4, 2015

@faho

This comment has been minimized.

Copy link
Member

commented Oct 4, 2015

This has been fixed via #2296, no?

@faho faho closed this Oct 4, 2015

@ridiculousfish

This comment has been minimized.

Copy link
Member

commented Oct 4, 2015

woohoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.