New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git completions: Some alias names don't map to variable names #4147

Closed
faho opened this Issue Jun 20, 2017 · 18 comments

Comments

Projects
None yet
2 participants
@faho
Member

faho commented Jun 20, 2017

As reported on gitter by @zx8, git aliases can contain characters that variables can't, which causes the git completions to spew errors.

This requires us to encode the alias name. The easiest thing that I came up with is set -l escaped_alias (printf '%02X' "'"(string split '' -- $alias)), which hex-encodes the alias characters.

@faho faho added this to the fish 2.7.0 milestone Jun 20, 2017

@faho faho self-assigned this Jun 20, 2017

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

FWIW, As I noted in issue #4048 to store each abbreviation as a separate variable requires the same sort of encoding/decoding to and from a valid variable name. I think we can do better than hex encoding every character. Since most of the characters will be alphanumeric it would be better to leave those alone for readability. Also, using %02X won't work if it is a non-ASCII character. Whether this can be done using fish script or will need to be done as new string subcommands is TBD.

Contributor

krader1961 commented Jun 20, 2017

FWIW, As I noted in issue #4048 to store each abbreviation as a separate variable requires the same sort of encoding/decoding to and from a valid variable name. I think we can do better than hex encoding every character. Since most of the characters will be alphanumeric it would be better to leave those alone for readability. Also, using %02X won't work if it is a non-ASCII character. Whether this can be done using fish script or will need to be done as new string subcommands is TBD.

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

Also, using %02X won't work if it is a non-ASCII character.

It actually works - printf '%02X' \'(string split '' aöü﷽a) prints "61F6FCFDFD61".

I've taken this from __fish_urlencode - which would basically work, if it didn't include "%" everwhere. That sets LC_ALL to C, which I so far haven't actually found a need for.

Member

faho commented Jun 20, 2017

Also, using %02X won't work if it is a non-ASCII character.

It actually works - printf '%02X' \'(string split '' aöü﷽a) prints "61F6FCFDFD61".

I've taken this from __fish_urlencode - which would basically work, if it didn't include "%" everwhere. That sets LC_ALL to C, which I so far haven't actually found a need for.

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

The problem with that approach is it's encoding the individual bytes of the wide char and is thus ambiguous. Try this:

printf '%X' \'(string split '' a\u6161a)`
printf '%X' \'(string split '' aaaa)

Notice that they produce the same output. There's a reason __fish_urlencode sets the locale to C. It forces the chars to be encoded as UTF-8 so there isn't any ambiguity and no collisions.

Also, what is with the escaped single quote? Without it printf complains, as expected, that each letter is an invalid argument for %X. But I don't see where it is documented that a leading single quote causes the letter to be converted to its integer equivalent. Even if it is documented it is way too magical for my taste.

Contributor

krader1961 commented Jun 20, 2017

The problem with that approach is it's encoding the individual bytes of the wide char and is thus ambiguous. Try this:

printf '%X' \'(string split '' a\u6161a)`
printf '%X' \'(string split '' aaaa)

Notice that they produce the same output. There's a reason __fish_urlencode sets the locale to C. It forces the chars to be encoded as UTF-8 so there isn't any ambiguity and no collisions.

Also, what is with the escaped single quote? Without it printf complains, as expected, that each letter is an invalid argument for %X. But I don't see where it is documented that a leading single quote causes the letter to be converted to its integer equivalent. Even if it is documented it is way too magical for my taste.

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

Compare my previous example with this version. Put the following in a file and source it:

echo -n a\u6161a | env LC_ALL=C fish -c 'read -z x; echo $x; printf "%02X" \\\'(string split \'\' $x)'
echo
echo -n aaaa | env LC_ALL=C fish -c 'read -z x; echo $x; printf "%02X" \\\'(string split \'\' $x)'
echo
Contributor

krader1961 commented Jun 20, 2017

Compare my previous example with this version. Put the following in a file and source it:

echo -n a\u6161a | env LC_ALL=C fish -c 'read -z x; echo $x; printf "%02X" \\\'(string split \'\' $x)'
echo
echo -n aaaa | env LC_ALL=C fish -c 'read -z x; echo $x; printf "%02X" \\\'(string split \'\' $x)'
echo
@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

The problem with that approach is it's encoding the individual bytes of the wide char and is thus ambiguous. Try this:

Huh? I'm afraid I'm missing something here - \u6161 encodes as e685a1 in UTF-8, and any byte in UTF-8 cannot be confused with a single-byte codepoint simply because it would have the continuation-bits set. (Of course this would be different with other encodings)

I mean I'm seeing the same result, I just don't get why.

Of course I don't have to understand this precisely to implement it since setting LC_ALL works, but I'd still like to know.

Also, what is with the escaped single quote? Without it printf complains, as expected, that each letter is an invalid argument for %X. But I don't see where it is documented that a leading single quote causes the letter to be converted to its integer equivalent. Even if it is documented it is way too magical for my taste.

I'll have to check in the PR that introduced it in __fish_urlencode, but it's not a feature specific to our printf - my gnu printf does it as well. It seems to be a feature of the printf library function (not that I could find it in the docs for that).

Member

faho commented Jun 20, 2017

The problem with that approach is it's encoding the individual bytes of the wide char and is thus ambiguous. Try this:

Huh? I'm afraid I'm missing something here - \u6161 encodes as e685a1 in UTF-8, and any byte in UTF-8 cannot be confused with a single-byte codepoint simply because it would have the continuation-bits set. (Of course this would be different with other encodings)

I mean I'm seeing the same result, I just don't get why.

Of course I don't have to understand this precisely to implement it since setting LC_ALL works, but I'd still like to know.

Also, what is with the escaped single quote? Without it printf complains, as expected, that each letter is an invalid argument for %X. But I don't see where it is documented that a leading single quote causes the letter to be converted to its integer equivalent. Even if it is documented it is way too magical for my taste.

I'll have to check in the PR that introduced it in __fish_urlencode, but it's not a feature specific to our printf - my gnu printf does it as well. It seems to be a feature of the printf library function (not that I could find it in the docs for that).

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

Okay, see https://www.gnu.org/software/coreutils/manual/html_node/printf-invocation.html:

If the leading character of a numeric argument is ‘"’ or ‘'’ then its value is the numeric value of the immediately following character. Any remaining characters are silently ignored if the POSIXLY_CORRECT environment variable is set; otherwise, a warning is printed. For example, ‘printf "%d" "'a"’ outputs ‘97’ on hosts that use the ASCII character set, since ‘a’ has the numeric value 97 in ASCII.

Member

faho commented Jun 20, 2017

Okay, see https://www.gnu.org/software/coreutils/manual/html_node/printf-invocation.html:

If the leading character of a numeric argument is ‘"’ or ‘'’ then its value is the numeric value of the immediately following character. Any remaining characters are silently ignored if the POSIXLY_CORRECT environment variable is set; otherwise, a warning is printed. For example, ‘printf "%d" "'a"’ outputs ‘97’ on hosts that use the ASCII character set, since ‘a’ has the numeric value 97 in ASCII.

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

I shouldn't have said "it's encoding the individual bytes of the wide char". It's encoding each wide char as an int. Since \u6161 is equivalent to \x6161 the %02X is actually treated as if it was %4X since the format specifies the minimum number of digits to print. Since the value requires more digits it prints the additional digits. Compare with this:

$ printf '%02X' \'(string split '' a\u0345a)`
6134561⏎

Notice the odd number of hex digits.

Contributor

krader1961 commented Jun 20, 2017

I shouldn't have said "it's encoding the individual bytes of the wide char". It's encoding each wide char as an int. Since \u6161 is equivalent to \x6161 the %02X is actually treated as if it was %4X since the format specifies the minimum number of digits to print. Since the value requires more digits it prints the additional digits. Compare with this:

$ printf '%02X' \'(string split '' a\u0345a)`
6134561⏎

Notice the odd number of hex digits.

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

Since \u6161 is equivalent to \x6161 the %02X is actually treated as if it was %4X since the format specifies the minimum number of digits to print.

That clears that up, thanks!

Soo... this should work if we just set -lx LC_ALL C for the conversion, right?

One more thing:

I think we can do better than hex encoding every character. Since most of the characters will be alphanumeric it would be better to leave those alone for readability.

How exactly do we check that? Currently I'm using string match -qr '\W' to return false for characters that can appear in a variable name, and it seems to be okay for git aliases specifically (since the rules on those seem rather strict - e.g. 'würst' doesn't work). We can't use something like [a-zA-Z0-9_] because e.g. "würst" is a valid variable name, so I'm assuming other "normal" characters work as well, and I don't think we can enumerate them all. Conveniently, '\W' seems to include "-" but not "_".

Member

faho commented Jun 20, 2017

Since \u6161 is equivalent to \x6161 the %02X is actually treated as if it was %4X since the format specifies the minimum number of digits to print.

That clears that up, thanks!

Soo... this should work if we just set -lx LC_ALL C for the conversion, right?

One more thing:

I think we can do better than hex encoding every character. Since most of the characters will be alphanumeric it would be better to leave those alone for readability.

How exactly do we check that? Currently I'm using string match -qr '\W' to return false for characters that can appear in a variable name, and it seems to be okay for git aliases specifically (since the rules on those seem rather strict - e.g. 'würst' doesn't work). We can't use something like [a-zA-Z0-9_] because e.g. "würst" is a valid variable name, so I'm assuming other "normal" characters work as well, and I don't think we can enumerate them all. Conveniently, '\W' seems to include "-" but not "_".

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

Try this function:

function encode_to_var
    set -l orig_str "$argv"
    set -l new_str
    for letter in (string split '' $orig_str)
        if string match -qr '[\w\d]' -- $letter
            if test $letter = '_'
                set new_str {$new_str}__
            else
                set new_str "$new_str$letter"
            end
        else
            echo -n $letter | begin
                set -lx LC_ALL C
                read -lz utf8
                # string split '' -- $utf8 | od -tx1x >&2
                printf '_%02X' \'(string split '' -- $utf8)
            end | read letter
            set new_str {$new_str}$letter
        end
    end
    echo $new_str
end

For example:

$ encode_to_var würst
w_C3_BCrst
Contributor

krader1961 commented Jun 20, 2017

Try this function:

function encode_to_var
    set -l orig_str "$argv"
    set -l new_str
    for letter in (string split '' $orig_str)
        if string match -qr '[\w\d]' -- $letter
            if test $letter = '_'
                set new_str {$new_str}__
            else
                set new_str "$new_str$letter"
            end
        else
            echo -n $letter | begin
                set -lx LC_ALL C
                read -lz utf8
                # string split '' -- $utf8 | od -tx1x >&2
                printf '_%02X' \'(string split '' -- $utf8)
            end | read letter
            set new_str {$new_str}$letter
        end
    end
    echo $new_str
end

For example:

$ encode_to_var würst
w_C3_BCrst
@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

Doing the inverse of my encode_to_var function will be slightly more difficult without resorting to C++ but should be doable.

Contributor

krader1961 commented Jun 20, 2017

Doing the inverse of my encode_to_var function will be slightly more difficult without resorting to C++ but should be doable.

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member
   if string match -qr '[\w\d]' -- $letter

\w actually includes digits - string match -r '\w' a 1.

           set new_str {$new_str}__

This $new_str actually needs to be quoted, or $new_str needs to be set to the empty string in the declaration. Otherwise the cartesian product will bite you for variable names starting with "_".

echo -n $letter | begin
          set -lx LC_ALL C
          read -lz utf8

What exactly is the purpose of these reads? Are you just trying to minimize the impact of LC_ALL?

Member

faho commented Jun 20, 2017

   if string match -qr '[\w\d]' -- $letter

\w actually includes digits - string match -r '\w' a 1.

           set new_str {$new_str}__

This $new_str actually needs to be quoted, or $new_str needs to be set to the empty string in the declaration. Otherwise the cartesian product will bite you for variable names starting with "_".

echo -n $letter | begin
          set -lx LC_ALL C
          read -lz utf8

What exactly is the purpose of these reads? Are you just trying to minimize the impact of LC_ALL?

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

\w actually includes digits

Why is that a problem? Digits are legal in variable names.

or $new_str needs to be set to the empty string in the declaration

Yes, good catch. That's why unit tests that cover all the corner cases are needed for code like this.

What exactly is the purpose of these reads?

It's so that the only place we deal with the individual UTF-8 bytes is that block. Everywhere else fish works with its normal encoding. This minimizes potential problems. It's also a good argument for doing this in C++.

Contributor

krader1961 commented Jun 20, 2017

\w actually includes digits

Why is that a problem? Digits are legal in variable names.

or $new_str needs to be set to the empty string in the declaration

Yes, good catch. That's why unit tests that cover all the corner cases are needed for code like this.

What exactly is the purpose of these reads?

It's so that the only place we deal with the individual UTF-8 bytes is that block. Everywhere else fish works with its normal encoding. This minimizes potential problems. It's also a good argument for doing this in C++.

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

Why is that a problem? Digits are legal in variable names.

That just means the "\d" is redundant 😄.

That's why unit tests that cover all the corner cases are needed for code like this.

Yes.

It's so that the only place we deal with the individual UTF-8 bytes is that block. Everywhere else fish works with its normal encoding.

A normal begin; end block would suffice for that.

Here's what I came up with (after adding your underline bit because that's neat):

function fish_escape_varname -a name
    # Hex-encode any characters that cannot appear in a variable name
    set -l escaped_name
    for c in (string split '' -- $name)
        if string match -qr '\W' -- $c
            set -lx LC_ALL C
            # Hex-encode the character, surround with "_"
            # to improve readability.
            set escaped_name "$escaped_name"(printf '_%X_' "'"$c)
        else
            # Double any literal "_".
            if test "$c" = "_"
                set escaped_name "$escaped_name"__
            else
                set escaped_name "$escaped_name$c"
            end
        end
    end
    echo $escaped_name
end
Member

faho commented Jun 20, 2017

Why is that a problem? Digits are legal in variable names.

That just means the "\d" is redundant 😄.

That's why unit tests that cover all the corner cases are needed for code like this.

Yes.

It's so that the only place we deal with the individual UTF-8 bytes is that block. Everywhere else fish works with its normal encoding.

A normal begin; end block would suffice for that.

Here's what I came up with (after adding your underline bit because that's neat):

function fish_escape_varname -a name
    # Hex-encode any characters that cannot appear in a variable name
    set -l escaped_name
    for c in (string split '' -- $name)
        if string match -qr '\W' -- $c
            set -lx LC_ALL C
            # Hex-encode the character, surround with "_"
            # to improve readability.
            set escaped_name "$escaped_name"(printf '_%X_' "'"$c)
        else
            # Double any literal "_".
            if test "$c" = "_"
                set escaped_name "$escaped_name"__
            else
                set escaped_name "$escaped_name$c"
            end
        end
    end
    echo $escaped_name
end
@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

That doesn't do the right thing:

$ fish_escape_varname würst
w_FC_rst

This is a good example of why I cringe when I see us trying to deal with UTF-8 in fish script.

Contributor

krader1961 commented Jun 20, 2017

That doesn't do the right thing:

$ fish_escape_varname würst
w_FC_rst

This is a good example of why I cringe when I see us trying to deal with UTF-8 in fish script.

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

Huh... interesting.

This is a good example of why I cringe when I see us trying to deal with UTF-8 in fish script.

Yeah. The alternative is doing it in C++. I'd suggest an option to string escape - string escape --varname or string escape --style=varname - which could also be extended for url-encoding.

Member

faho commented Jun 20, 2017

Huh... interesting.

This is a good example of why I cringe when I see us trying to deal with UTF-8 in fish script.

Yeah. The alternative is doing it in C++. I'd suggest an option to string escape - string escape --varname or string escape --style=varname - which could also be extended for url-encoding.

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 20, 2017

Contributor

I'd suggest an option to string escape - string escape --varname or string escape --style=varname - which could also be extended for url-encoding.

Agreed. I think that's a better idea than a new string subcommand. And infinitely better than doing it in fish script. Do you want to implement it or shall I?

Contributor

krader1961 commented Jun 20, 2017

I'd suggest an option to string escape - string escape --varname or string escape --style=varname - which could also be extended for url-encoding.

Agreed. I think that's a better idea than a new string subcommand. And infinitely better than doing it in fish script. Do you want to implement it or shall I?

@faho

This comment has been minimized.

Show comment
Hide comment
@faho

faho Jun 20, 2017

Member

Do you want to implement it or shall I?

Honestly, I hate C++'s string handling, so if I may pass the buck...

Member

faho commented Jun 20, 2017

Do you want to implement it or shall I?

Honestly, I hate C++'s string handling, so if I may pass the buck...

@krader1961

This comment has been minimized.

Show comment
Hide comment
@krader1961

krader1961 Jun 23, 2017

Contributor

The PR I just created for issue #4150 will make it trivial to fix this issue.

Contributor

krader1961 commented Jun 23, 2017

The PR I just created for issue #4150 will make it trivial to fix this issue.

@faho faho closed this in 3b5fdc3 Jun 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment