Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of a valid identifier is incomplete (or fish is accepting "invalid" identiers) #5813

Open
0ion9 opened this issue Apr 12, 2019 · 12 comments
Milestone

Comments

@0ion9
Copy link
Contributor

0ion9 commented Apr 12, 2019

The docs (scroll up a little after loading this link) say:

A variable name cannot be empty. It can contain only letters, digits, and underscores. It may begin and end with any of those characters.

From which you might guess that string match -qr "^[[:alnum:]_]+$" would validate identifiers, interpreting the meaning of 'letters' and 'digits' in a unicode-aware manner.

However, that test is too stringent; it fails to validate identifiers which fish itself accepts.
For example

set -l ⓣ 22; set -l

shows that ⓣ is a valid identifier.
Conversely,

set -l ⒯ 22; set -l
shows that ⒯ is not a valid identifier.

The mentioned 'string match' test rejects both ⓣ and ⒯.

The reason why characters such as ⓣ are accepted in the first place is not clear; they are in unicode categories Symbol/Other, but so is the rejected ⒯.

Either fish is accepting identifier names it should be rejecting, or the documentation is not an adequate description of fishes actual identifier validation.

EDIT: there appears to be a further problem, the regexp character categories are not unicode aware; for example the 'string match' test does not validate À, which fish itself rightly accepts (character is in unicode categories Letter/Uppercase). The string documentation gives no indication of the fact that these regexp character classes are not unicode aware.

@floam
Copy link
Member

floam commented Apr 12, 2019

there appears to be a further problem, the regexp character categories are not unicode aware;

We use PCRE2 which is really great at unicode. But :alnum: is a POSIX character class and by default PCRE2 follows POSIX and limits these to ASCII. https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC10

You can enable Unicode properties for these generic classes with (*UCP) at the beginning of your pattern, per the PCRE2 documentation. There is a way for us to flip this on at compile time, we may never have considered it.

@0ion9
Copy link
Contributor Author

0ion9 commented Apr 12, 2019

Good to know. It seems that [:alnum:] is definitely wrong, it causes the matching of
¼ for example, fish forbids that.

The weird thing you can notice is that [:alnum:] is NOT simply equivalent to [:alpha:][:digit:]
(which correctly rejects ¼). [[:alpha:][:digit:]_] with (*UCP) appears to do the right thing in all cases I've tested.

@zanchey zanchey added this to the fish-future milestone Dec 27, 2020
@zanchey
Copy link
Member

zanchey commented Dec 27, 2020

The current implementation is basically _ and anything that iswalnum() returns true for, which is (excitingly) locale dependent.

@kopischke
Copy link

kopischke commented Mar 8, 2021

Having delved into this particular rabbit hole while developing a syntax plugin for an editor that happens to use PCRE2 internally, I have come to the reluctant conclusion that there is no way to correctly match iswalnum()’s quixotic character set with PCRE2’s Unicode facilities. The closest I have come is [_0-9\p{Ll}\p{Lu}], which, with a mere 3950 code points, is a far less expansive character set than the one suggested by @0ion9 (that one is equivalent to [_\p{Nd}\p{L}] and boasts 126,274 code points). Most importantly, that narrower set still produces false positives: for example, fish won’t accept the Cherokee characters this includes as identifiers, at least not in a German UTF-8 locale.

@ChristoferK
Copy link

The restriction placed on identifiers for variables is rather, well, restrictive. Apart from the following characters or character sets:

  • white space characters;
  • non-printing characters;
  • [, ], (, ), {, };
  • $;
  • <, >; and
  • ;

why not permit all others ?

@zanchey
Copy link
Member

zanchey commented May 5, 2021

I'm not sure what the rationale is for including other punctuation, including current and potential syntax?

@ChristoferK
Copy link

I'm not sure what the rationale is for including other punctuation, including current and potential syntax?

I'd say that was, in itself, a rationale. Unless there was very strong reasons not to widen the character set, I think it would make more sense to have the least restrictive design implementations.

From FiSH's FAQ:
🙶The fish design has three high level goals.... Everything that can be done in other shell languages should be possible to do in fish.🙸

I believe FiSH has one of the heaviest restrictions placed upon its identifiers' amongst the shell languages. It even prevents use of characters such as :, -, and .. The only non-word character to separate elements in an identifier that has any ease-of-use is _, but it gets unwieldy to organise the namespace in a manner that's effective or friendly. I've bound a key to inserting ϟ into the commandline, which provides a second delimiting symbol.

Though this will vary a little between systems, for reference, enumerating the valid character set produces:

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T 
U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w 
x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø 
Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö 
ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ 
Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij 
Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő 
Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů 
Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ 
Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ 
Ƭ ƭ Ʈ Ư ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ Ƽ ƽ ƾ ƿ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ 
Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ 
ǭ Ǯ ǯ ǰ DZ Dz dz Ǵ ǵ Ƕ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ ȉ Ȋ 
ȋ Ȍ ȍ Ȏ ȏ Ȑ ȑ Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ Ƞ Ȣ ȣ Ȥ ȥ Ȧ ȧ Ȩ ȩ 
Ȫ ȫ Ȭ ȭ Ȯ ȯ Ȱ ȱ Ȳ ȳ ɐ ɑ ɒ ɓ ɔ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ ɠ ɡ ɢ ɣ 
ɤ ɥ ɦ ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ ɰ ɱ ɲ ɳ ɴ ɵ ɶ ɷ ɸ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʀ ʁ 
ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ ʎ ʏ ʐ ʑ ʒ ʓ ʔ ʕ ʖ ʗ ʘ ʙ ʚ ʛ ʜ ʝ ʞ ʟ 
ʠ ʡ ʢ ʣ ʤ ʥ ʦ ʧ ʨ ʩ ʪ ʫ ʬ ʭ ʮ ʯ Ά Έ Ή Ί Ό Ύ Ώ ΐ Α Β Γ Δ Ε Ζ 
Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω Ϊ Ϋ ά έ ή ί ΰ α β γ δ ε 
ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω ϊ ϋ ό ύ ώ ϐ ϑ ϒ ϓ ϔ 
ϕ ϖ ϗ Ϙ ϙ Ϛ ϛ Ϝ ϝ Ϟ ϟ Ϡ ϡ Ϣ ϣ Ϥ ϥ Ϧ ϧ Ϩ ϩ Ϫ ϫ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ 
ϳ ϴ ϵ Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й К 
Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ж з и 
й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ѐ ё ђ ѓ є ѕ і 
ї ј љ њ ћ ќ ѝ ў џ Ѡ ѡ Ѣ ѣ Ѥ ѥ Ѧ ѧ Ѩ ѩ Ѫ ѫ Ѭ ѭ Ѯ ѯ Ѱ ѱ Ѳ ѳ Ѵ 
ѵ Ѷ ѷ Ѹ ѹ Ѻ ѻ Ѽ ѽ Ѿ ѿ Ҁ ҁ Ҋ ҋ Ҍ ҍ Ҏ ҏ Ґ ґ Ғ ғ Ҕ ҕ Җ җ Ҙ ҙ Қ 
қ Ҝ ҝ Ҟ ҟ Ҡ ҡ Ң ң Ҥ ҥ Ҧ ҧ Ҩ ҩ Ҫ ҫ Ҭ ҭ Ү ү Ұ ұ Ҳ ҳ Ҵ ҵ Ҷ ҷ Ҹ 
ҹ Һ һ Ҽ ҽ Ҿ ҿ Ӏ Ӂ ӂ Ӄ ӄ Ӆ ӆ Ӈ ӈ Ӊ ӊ Ӌ ӌ Ӎ ӎ Ӑ ӑ Ӓ ӓ Ӕ ӕ Ӗ ӗ 
Ә ә Ӛ ӛ Ӝ ӝ Ӟ ӟ Ӡ ӡ Ӣ ӣ Ӥ ӥ Ӧ ӧ Ө ө Ӫ ӫ Ӭ ӭ Ӯ ӯ Ӱ ӱ Ӳ ӳ Ӵ ӵ 
Ӹ ӹ Ԁ ԁ Ԃ ԃ Ԅ ԅ Ԇ ԇ Ԉ ԉ Ԋ ԋ Ԍ ԍ Ԏ ԏ Ա Բ Գ Դ Ե Զ Է Ը Թ Ժ Ի Լ 
Խ Ծ Կ Հ Ձ Ղ Ճ Մ Յ Ն Շ Ո Չ Պ Ջ Ռ Ս Վ Տ Ր Ց Ւ Փ Ք Օ Ֆ ա բ գ դ 
ե զ է ը թ ժ ի լ խ ծ կ հ ձ ղ ճ մ յ ն շ ո չ պ ջ ռ ս վ տ ր ց ւ 
փ ք օ ֆ և Ⴀ Ⴁ Ⴂ Ⴃ Ⴄ Ⴅ Ⴆ Ⴇ Ⴈ Ⴉ Ⴊ Ⴋ Ⴌ Ⴍ Ⴎ Ⴏ Ⴐ Ⴑ Ⴒ Ⴓ Ⴔ Ⴕ Ⴖ Ⴗ Ⴘ 
Ⴙ Ⴚ Ⴛ Ⴜ Ⴝ Ⴞ Ⴟ Ⴠ Ⴡ Ⴢ Ⴣ Ⴤ Ⴥ Ḁ ḁ Ḃ ḃ Ḅ ḅ Ḇ ḇ Ḉ ḉ Ḋ ḋ Ḍ ḍ Ḏ ḏ Ḑ 
ḑ Ḓ ḓ Ḕ ḕ Ḗ ḗ Ḙ ḙ Ḛ ḛ Ḝ ḝ Ḟ ḟ Ḡ ḡ Ḣ ḣ Ḥ ḥ Ḧ ḧ Ḩ ḩ Ḫ ḫ Ḭ ḭ Ḯ 
ḯ Ḱ ḱ Ḳ ḳ Ḵ ḵ Ḷ ḷ Ḹ ḹ Ḻ ḻ Ḽ ḽ Ḿ ḿ Ṁ ṁ Ṃ ṃ Ṅ ṅ Ṇ ṇ Ṉ ṉ Ṋ ṋ Ṍ 
ṍ Ṏ ṏ Ṑ ṑ Ṓ ṓ Ṕ ṕ Ṗ ṗ Ṙ ṙ Ṛ ṛ Ṝ ṝ Ṟ ṟ Ṡ ṡ Ṣ ṣ Ṥ ṥ Ṧ ṧ Ṩ ṩ Ṫ 
ṫ Ṭ ṭ Ṯ ṯ Ṱ ṱ Ṳ ṳ Ṵ ṵ Ṷ ṷ Ṹ ṹ Ṻ ṻ Ṽ ṽ Ṿ ṿ Ẁ ẁ Ẃ ẃ Ẅ ẅ Ẇ ẇ Ẉ 
ẉ Ẋ ẋ Ẍ ẍ Ẏ ẏ Ẑ ẑ Ẓ ẓ Ẕ ẕ ẖ ẗ ẘ ẙ ẚ ẛ Ạ ạ Ả ả Ấ ấ Ầ ầ Ẩ ẩ Ẫ 
ẫ Ậ ậ Ắ ắ Ằ ằ Ẳ ẳ Ẵ ẵ Ặ ặ Ẹ ẹ Ẻ ẻ Ẽ ẽ Ế ế Ề ề Ể ể Ễ ễ Ệ ệ Ỉ 
ỉ Ị ị Ọ ọ Ỏ ỏ Ố ố Ồ ồ Ổ ổ Ỗ ỗ Ộ ộ Ớ ớ Ờ ờ Ở ở Ỡ ỡ Ợ ợ Ụ ụ Ủ 
ủ Ứ ứ Ừ ừ Ử ử Ữ ữ Ự ự Ỳ ỳ Ỵ ỵ Ỷ ỷ Ỹ ỹ ἀ ἁ ἂ ἃ ἄ ἅ ἆ ἇ Ἀ Ἁ Ἂ 
Ἃ Ἄ Ἅ Ἆ Ἇ ἐ ἑ ἒ ἓ ἔ ἕ Ἐ Ἑ Ἒ Ἓ Ἔ Ἕ ἠ ἡ ἢ ἣ ἤ ἥ ἦ ἧ Ἠ Ἡ Ἢ Ἣ Ἤ 
Ἥ Ἦ Ἧ ἰ ἱ ἲ ἳ ἴ ἵ ἶ ἷ Ἰ Ἱ Ἲ Ἳ Ἴ Ἵ Ἶ Ἷ ὀ ὁ ὂ ὃ ὄ ὅ Ὀ Ὁ Ὂ Ὃ Ὄ 
Ὅ ὐ ὑ ὒ ὓ ὔ ὕ ὖ ὗ Ὑ Ὓ Ὕ Ὗ ὠ ὡ ὢ ὣ ὤ ὥ ὦ ὧ Ὠ Ὡ Ὢ Ὣ Ὤ Ὥ Ὦ Ὧ ὰ 
ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ ᾀ ᾁ ᾂ ᾃ ᾄ ᾅ ᾆ ᾇ ᾈ ᾉ ᾊ ᾋ ᾌ ᾍ ᾎ ᾏ ᾐ 
ᾑ ᾒ ᾓ ᾔ ᾕ ᾖ ᾗ ᾘ ᾙ ᾚ ᾛ ᾜ ᾝ ᾞ ᾟ ᾠ ᾡ ᾢ ᾣ ᾤ ᾥ ᾦ ᾧ ᾨ ᾩ ᾪ ᾫ ᾬ ᾭ ᾮ 
ᾯ ᾰ ᾱ ᾲ ᾳ ᾴ ᾶ ᾷ Ᾰ Ᾱ Ὰ Ά ᾼ ι ῂ ῃ ῄ ῆ ῇ Ὲ Έ Ὴ Ή ῌ ῐ ῑ ῒ ΐ ῖ ῗ 
Ῐ Ῑ Ὶ Ί ῠ ῡ ῢ ΰ ῤ ῥ ῦ ῧ Ῠ Ῡ Ὺ Ύ Ῥ ῲ ῳ ῴ ῶ ῷ Ὸ Ό Ὼ Ώ ῼ ⁱ ⁿ ℂ 
ℇ ℊ ℋ ℌ ℍ ℎ ℏ ℐ ℑ ℒ ℓ ℕ ℙ ℚ ℛ ℜ ℝ ℤ Ω ℨ K Å ℬ ℭ ℯ ℰ ℱ ℳ ℴ ℹ 

@faho
Copy link
Member

faho commented May 17, 2021

From FiSH's FAQ:

It's spelled "fish", or at the beginning of a sentence "Fish". I have no idea where that capitalization comes from, we've never used it. Not doing that makes it both easier on the eyes and easier to write.

Though this will vary a little between systems, for reference, enumerating the valid character set produces:

That seems to be incomplete, and it will vary a lot between systems. For example I don't see here, and that's just the example from the opening post.

We use "iswalnum" to validate identifiers, and that both depends on the locale (if you have LC_ALL=C anywhere you're gonna have a bad time - tho you're gonna have a bad time in many ways if you do. We might even want to warn if you do because it's basically broken) and the supported unicode version.

Since we're not going to implement a table of allowed codepoints ourselves, we could change to iswgraph here - no control characters, unprintable characters or whitespace, minus the characters special to fish - no $, *, {}, () or [].

HOWEVER: That's a breaking change.

If you had this:

set foo 1
echo $foo-bar

This currently prints "1-bar". If we added - to the allowed characters in variables, that would try to print the variable called "foo-bar".

I'm not sure that's worth it.

@faho
Copy link
Member

faho commented May 19, 2021

Okay, here's what is needed to get our test suite to pass with iswgraph:

bool valid_var_name_char(wchar_t chr) { return fish_iswgraph(chr) && !wcschr(L"{}\"'\\()|><$[];#*&?^:-,%./@=", chr); }

Some of these are characters special to fish itself - like $ or ". These would have to be excluded. Some are explicitly tested in our test suite like , (and we could remove them, but would have to adjust the test).

Some of these are just characters that happen to be used often after a variable like /. E.g.

export PATH="$PATH:$HOME/.local/bin" # surely in thousands of config.fish
echo $var-thing
printf "%s$format%s" # from fish_git_prompt
set opt_spec "$opt_spec=+" # from fish_opt
for file in $__fish_config_dir/conf.d/*.fish  # share/config.fish
echo $USER@$hostname # in a bunch of prompts
cp $file $file.mp4 # something I write quite often

are completely normal things to write! Something like that will be used in a lot of existing fish script, and requiring a change to

echo {$USER}@$hostname

would be a pain in the gluteals for all our users. At that point I'm not sure what is gained here? You can now name a variable en-dash or , I guess?

The current logic is "alphanumeric and _", which might be viewed as overly restrictive, but the most common characters we could add to that are already in use after variables and there is no good way of allowing them without breaking any code that relies on them not being allowed in variable names. The rest aren't as common or as useful - and note that we already allow maßeinheit and 测量单位 because those are alphanumeric (with the correct locale), so the internationalization argument doesn't apply!

So: Sorry, but that's not worth it. We might want to look into why string match rejects what we have now instead.

EDIT: Okay, from the pcre2 docs:

In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \Q...\E inside a character class.

The current code has a scare comment about using "PCRE2_NEVER_UTF" because of "security", I'm not sure what that's referring to.

@krobelus
Copy link
Member

Yeah I don't see a compelling reason to stop using iswalnum

@zanchey
Copy link
Member

zanchey commented May 21, 2021

So the behaviour is what we want; does the documentation need improving? Instead of "A variable name can not be empty and can contain only letters, digits, and underscores.", what about "can contain only alphanumeric characters (as defined by the current locale) and underscores"? Personally, I think that's worse and I would be happy to leave it as is.

@faho
Copy link
Member

faho commented May 25, 2021

Instead of "A variable name can not be empty and can contain only letters, digits, and underscores.", what about "can contain only alphanumeric characters (as defined by the current locale) and underscores"?

We might want to add a note about "letter" and "digit" being locale-dependent, but I'm not sure how locale-dependent that is.

Like I've said above, my de_DE.UTF-8 locale already says "位" is alnum, so it's apparently just C locales being different. I don't think we need to document that - if your locale is C, non-ASCII things will be broken. (and yes, if C.UTF-8 was a thing everywhere, I would absolutely be for hardcoding that as LC_CTYPE)

So the only remaining thing to do is to figure out what that scare comment is about and whether it affects PCRE2_UCP as well, and if possible set it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants