New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Definition of a valid identifier is incomplete (or fish is accepting "invalid" identiers) #5813
Comments
We use PCRE2 which is really great at unicode. But You can enable Unicode properties for these generic classes with |
Good to know. It seems that The weird thing you can notice is that |
The current implementation is basically |
Having delved into this particular rabbit hole while developing a syntax plugin for an editor that happens to use PCRE2 internally, I have come to the reluctant conclusion that there is no way to correctly match |
The restriction placed on identifiers for variables is rather, well, restrictive. Apart from the following characters or character sets:
why not permit all others ? |
I'm not sure what the rationale is for including other punctuation, including current and potential syntax? |
I'd say that was, in itself, a rationale. Unless there was very strong reasons not to widen the character set, I think it would make more sense to have the least restrictive design implementations. From I believe Though this will vary a little between systems, for reference, enumerating the valid character set produces: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T
U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w
x y z ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø
Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö
ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ
Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij
Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő
Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů
Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ
Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ
Ƭ ƭ Ʈ Ư ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ Ƽ ƽ ƾ ƿ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ
Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ
ǭ Ǯ ǯ ǰ DZ Dz dz Ǵ ǵ Ƕ Ƿ Ǹ ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ ȉ Ȋ
ȋ Ȍ ȍ Ȏ ȏ Ȑ ȑ Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ Ƞ Ȣ ȣ Ȥ ȥ Ȧ ȧ Ȩ ȩ
Ȫ ȫ Ȭ ȭ Ȯ ȯ Ȱ ȱ Ȳ ȳ ɐ ɑ ɒ ɓ ɔ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ ɠ ɡ ɢ ɣ
ɤ ɥ ɦ ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ ɰ ɱ ɲ ɳ ɴ ɵ ɶ ɷ ɸ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʀ ʁ
ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ ʎ ʏ ʐ ʑ ʒ ʓ ʔ ʕ ʖ ʗ ʘ ʙ ʚ ʛ ʜ ʝ ʞ ʟ
ʠ ʡ ʢ ʣ ʤ ʥ ʦ ʧ ʨ ʩ ʪ ʫ ʬ ʭ ʮ ʯ Ά Έ Ή Ί Ό Ύ Ώ ΐ Α Β Γ Δ Ε Ζ
Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ Σ Τ Υ Φ Χ Ψ Ω Ϊ Ϋ ά έ ή ί ΰ α β γ δ ε
ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω ϊ ϋ ό ύ ώ ϐ ϑ ϒ ϓ ϔ
ϕ ϖ ϗ Ϙ ϙ Ϛ ϛ Ϝ ϝ Ϟ ϟ Ϡ ϡ Ϣ ϣ Ϥ ϥ Ϧ ϧ Ϩ ϩ Ϫ ϫ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ
ϳ ϴ ϵ Ѐ Ё Ђ Ѓ Є Ѕ І Ї Ј Љ Њ Ћ Ќ Ѝ Ў Џ А Б В Г Д Е Ж З И Й К
Л М Н О П Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я а б в г д е ж з и
й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ѐ ё ђ ѓ є ѕ і
ї ј љ њ ћ ќ ѝ ў џ Ѡ ѡ Ѣ ѣ Ѥ ѥ Ѧ ѧ Ѩ ѩ Ѫ ѫ Ѭ ѭ Ѯ ѯ Ѱ ѱ Ѳ ѳ Ѵ
ѵ Ѷ ѷ Ѹ ѹ Ѻ ѻ Ѽ ѽ Ѿ ѿ Ҁ ҁ Ҋ ҋ Ҍ ҍ Ҏ ҏ Ґ ґ Ғ ғ Ҕ ҕ Җ җ Ҙ ҙ Қ
қ Ҝ ҝ Ҟ ҟ Ҡ ҡ Ң ң Ҥ ҥ Ҧ ҧ Ҩ ҩ Ҫ ҫ Ҭ ҭ Ү ү Ұ ұ Ҳ ҳ Ҵ ҵ Ҷ ҷ Ҹ
ҹ Һ һ Ҽ ҽ Ҿ ҿ Ӏ Ӂ ӂ Ӄ ӄ Ӆ ӆ Ӈ ӈ Ӊ ӊ Ӌ ӌ Ӎ ӎ Ӑ ӑ Ӓ ӓ Ӕ ӕ Ӗ ӗ
Ә ә Ӛ ӛ Ӝ ӝ Ӟ ӟ Ӡ ӡ Ӣ ӣ Ӥ ӥ Ӧ ӧ Ө ө Ӫ ӫ Ӭ ӭ Ӯ ӯ Ӱ ӱ Ӳ ӳ Ӵ ӵ
Ӹ ӹ Ԁ ԁ Ԃ ԃ Ԅ ԅ Ԇ ԇ Ԉ ԉ Ԋ ԋ Ԍ ԍ Ԏ ԏ Ա Բ Գ Դ Ե Զ Է Ը Թ Ժ Ի Լ
Խ Ծ Կ Հ Ձ Ղ Ճ Մ Յ Ն Շ Ո Չ Պ Ջ Ռ Ս Վ Տ Ր Ց Ւ Փ Ք Օ Ֆ ա բ գ դ
ե զ է ը թ ժ ի լ խ ծ կ հ ձ ղ ճ մ յ ն շ ո չ պ ջ ռ ս վ տ ր ց ւ
փ ք օ ֆ և Ⴀ Ⴁ Ⴂ Ⴃ Ⴄ Ⴅ Ⴆ Ⴇ Ⴈ Ⴉ Ⴊ Ⴋ Ⴌ Ⴍ Ⴎ Ⴏ Ⴐ Ⴑ Ⴒ Ⴓ Ⴔ Ⴕ Ⴖ Ⴗ Ⴘ
Ⴙ Ⴚ Ⴛ Ⴜ Ⴝ Ⴞ Ⴟ Ⴠ Ⴡ Ⴢ Ⴣ Ⴤ Ⴥ Ḁ ḁ Ḃ ḃ Ḅ ḅ Ḇ ḇ Ḉ ḉ Ḋ ḋ Ḍ ḍ Ḏ ḏ Ḑ
ḑ Ḓ ḓ Ḕ ḕ Ḗ ḗ Ḙ ḙ Ḛ ḛ Ḝ ḝ Ḟ ḟ Ḡ ḡ Ḣ ḣ Ḥ ḥ Ḧ ḧ Ḩ ḩ Ḫ ḫ Ḭ ḭ Ḯ
ḯ Ḱ ḱ Ḳ ḳ Ḵ ḵ Ḷ ḷ Ḹ ḹ Ḻ ḻ Ḽ ḽ Ḿ ḿ Ṁ ṁ Ṃ ṃ Ṅ ṅ Ṇ ṇ Ṉ ṉ Ṋ ṋ Ṍ
ṍ Ṏ ṏ Ṑ ṑ Ṓ ṓ Ṕ ṕ Ṗ ṗ Ṙ ṙ Ṛ ṛ Ṝ ṝ Ṟ ṟ Ṡ ṡ Ṣ ṣ Ṥ ṥ Ṧ ṧ Ṩ ṩ Ṫ
ṫ Ṭ ṭ Ṯ ṯ Ṱ ṱ Ṳ ṳ Ṵ ṵ Ṷ ṷ Ṹ ṹ Ṻ ṻ Ṽ ṽ Ṿ ṿ Ẁ ẁ Ẃ ẃ Ẅ ẅ Ẇ ẇ Ẉ
ẉ Ẋ ẋ Ẍ ẍ Ẏ ẏ Ẑ ẑ Ẓ ẓ Ẕ ẕ ẖ ẗ ẘ ẙ ẚ ẛ Ạ ạ Ả ả Ấ ấ Ầ ầ Ẩ ẩ Ẫ
ẫ Ậ ậ Ắ ắ Ằ ằ Ẳ ẳ Ẵ ẵ Ặ ặ Ẹ ẹ Ẻ ẻ Ẽ ẽ Ế ế Ề ề Ể ể Ễ ễ Ệ ệ Ỉ
ỉ Ị ị Ọ ọ Ỏ ỏ Ố ố Ồ ồ Ổ ổ Ỗ ỗ Ộ ộ Ớ ớ Ờ ờ Ở ở Ỡ ỡ Ợ ợ Ụ ụ Ủ
ủ Ứ ứ Ừ ừ Ử ử Ữ ữ Ự ự Ỳ ỳ Ỵ ỵ Ỷ ỷ Ỹ ỹ ἀ ἁ ἂ ἃ ἄ ἅ ἆ ἇ Ἀ Ἁ Ἂ
Ἃ Ἄ Ἅ Ἆ Ἇ ἐ ἑ ἒ ἓ ἔ ἕ Ἐ Ἑ Ἒ Ἓ Ἔ Ἕ ἠ ἡ ἢ ἣ ἤ ἥ ἦ ἧ Ἠ Ἡ Ἢ Ἣ Ἤ
Ἥ Ἦ Ἧ ἰ ἱ ἲ ἳ ἴ ἵ ἶ ἷ Ἰ Ἱ Ἲ Ἳ Ἴ Ἵ Ἶ Ἷ ὀ ὁ ὂ ὃ ὄ ὅ Ὀ Ὁ Ὂ Ὃ Ὄ
Ὅ ὐ ὑ ὒ ὓ ὔ ὕ ὖ ὗ Ὑ Ὓ Ὕ Ὗ ὠ ὡ ὢ ὣ ὤ ὥ ὦ ὧ Ὠ Ὡ Ὢ Ὣ Ὤ Ὥ Ὦ Ὧ ὰ
ά ὲ έ ὴ ή ὶ ί ὸ ό ὺ ύ ὼ ώ ᾀ ᾁ ᾂ ᾃ ᾄ ᾅ ᾆ ᾇ ᾈ ᾉ ᾊ ᾋ ᾌ ᾍ ᾎ ᾏ ᾐ
ᾑ ᾒ ᾓ ᾔ ᾕ ᾖ ᾗ ᾘ ᾙ ᾚ ᾛ ᾜ ᾝ ᾞ ᾟ ᾠ ᾡ ᾢ ᾣ ᾤ ᾥ ᾦ ᾧ ᾨ ᾩ ᾪ ᾫ ᾬ ᾭ ᾮ
ᾯ ᾰ ᾱ ᾲ ᾳ ᾴ ᾶ ᾷ Ᾰ Ᾱ Ὰ Ά ᾼ ι ῂ ῃ ῄ ῆ ῇ Ὲ Έ Ὴ Ή ῌ ῐ ῑ ῒ ΐ ῖ ῗ
Ῐ Ῑ Ὶ Ί ῠ ῡ ῢ ΰ ῤ ῥ ῦ ῧ Ῠ Ῡ Ὺ Ύ Ῥ ῲ ῳ ῴ ῶ ῷ Ὸ Ό Ὼ Ώ ῼ ⁱ ⁿ ℂ
ℇ ℊ ℋ ℌ ℍ ℎ ℏ ℐ ℑ ℒ ℓ ℕ ℙ ℚ ℛ ℜ ℝ ℤ Ω ℨ K Å ℬ ℭ ℯ ℰ ℱ ℳ ℴ ℹ |
It's spelled "fish", or at the beginning of a sentence "Fish". I have no idea where that capitalization comes from, we've never used it. Not doing that makes it both easier on the eyes and easier to write.
That seems to be incomplete, and it will vary a lot between systems. For example I don't see We use "iswalnum" to validate identifiers, and that both depends on the locale (if you have LC_ALL=C anywhere you're gonna have a bad time - tho you're gonna have a bad time in many ways if you do. We might even want to warn if you do because it's basically broken) and the supported unicode version. Since we're not going to implement a table of allowed codepoints ourselves, we could change to iswgraph here - no control characters, unprintable characters or whitespace, minus the characters special to fish - no $, HOWEVER: That's a breaking change. If you had this: set foo 1
echo $foo-bar This currently prints "1-bar". If we added I'm not sure that's worth it. |
Okay, here's what is needed to get our test suite to pass with iswgraph: bool valid_var_name_char(wchar_t chr) { return fish_iswgraph(chr) && !wcschr(L"{}\"'\\()|><$[];#*&?^:-,%./@=", chr); } Some of these are characters special to fish itself - like Some of these are just characters that happen to be used often after a variable like export PATH="$PATH:$HOME/.local/bin" # surely in thousands of config.fish
echo $var-thing
printf "%s$format%s" # from fish_git_prompt
set opt_spec "$opt_spec=+" # from fish_opt
for file in $__fish_config_dir/conf.d/*.fish # share/config.fish
echo $USER@$hostname # in a bunch of prompts
cp $file $file.mp4 # something I write quite often are completely normal things to write! Something like that will be used in a lot of existing fish script, and requiring a change to echo {$USER}@$hostname would be a pain in the gluteals for all our users. At that point I'm not sure what is gained here? You can now name a variable en-dash or The current logic is "alphanumeric and _", which might be viewed as overly restrictive, but the most common characters we could add to that are already in use after variables and there is no good way of allowing them without breaking any code that relies on them not being allowed in variable names. The rest aren't as common or as useful - and note that we already allow So: Sorry, but that's not worth it. We might want to look into why EDIT: Okay, from the pcre2 docs:
The current code has a scare comment about using "PCRE2_NEVER_UTF" because of "security", I'm not sure what that's referring to. |
Yeah I don't see a compelling reason to stop using |
So the behaviour is what we want; does the documentation need improving? Instead of "A variable name can not be empty and can contain only letters, digits, and underscores.", what about "can contain only alphanumeric characters (as defined by the current locale) and underscores"? Personally, I think that's worse and I would be happy to leave it as is. |
We might want to add a note about "letter" and "digit" being locale-dependent, but I'm not sure how locale-dependent that is. Like I've said above, my de_DE.UTF-8 locale already says "位" is alnum, so it's apparently just C locales being different. I don't think we need to document that - if your locale is C, non-ASCII things will be broken. (and yes, if C.UTF-8 was a thing everywhere, I would absolutely be for hardcoding that as LC_CTYPE) So the only remaining thing to do is to figure out what that scare comment is about and whether it affects PCRE2_UCP as well, and if possible set it. |
The docs (scroll up a little after loading this link) say:
From which you might guess that
string match -qr "^[[:alnum:]_]+$"
would validate identifiers, interpreting the meaning of 'letters' and 'digits' in a unicode-aware manner.However, that test is too stringent; it fails to validate identifiers which fish itself accepts.
For example
set -l ⓣ 22; set -l
shows that ⓣ is a valid identifier.
Conversely,
set -l ⒯ 22; set -l
shows that ⒯ is not a valid identifier.
The mentioned 'string match' test rejects both ⓣ and ⒯.
The reason why characters such as ⓣ are accepted in the first place is not clear; they are in unicode categories Symbol/Other, but so is the rejected ⒯.
Either fish is accepting identifier names it should be rejecting, or the documentation is not an adequate description of fishes actual identifier validation.
EDIT: there appears to be a further problem, the regexp character categories are not unicode aware; for example the 'string match' test does not validate
À
, which fish itself rightly accepts (character is in unicode categories Letter/Uppercase). Thestring
documentation gives no indication of the fact that these regexp character classes are not unicode aware.The text was updated successfully, but these errors were encountered: