feat: support unicode for `subject-case` rule #3574

amariq · 2023-03-29T07:26:58Z

Expected Behavior

Would it be okay to allow the subject-case rule handle not only latin but any other language alphabet?

Current Behavior

Currently, the subject-case rule skips validation if message doesn't start with a latin alphabet character (only a to z with case irrelevant, as of now).

It became like that 5 years ago with the following change made — @commitlint/rules/src/subject-case.ts#L14.

Affected packages

cli
core
prompt
config-angular

Possible Solution

The solution, as I see it, would only require to change the RegExp that is used to filter invalid subject messages.

So, change what is used now (which only includes latin alphabet):

subject.match(/^[a-z]/i)

to a pattern with unicode support:

subject.match(/^\p{L}/ui)

where \p{L} means any kind of letter from any language and the u flag enables it.

Context

Basically, if a commit subject is written in non-English language then subject-case rule is just skipped.

This feature would benefit someone who (for one reason or another) writes commit messages (or, rather, subject) using different language than English but still wants to follow the commit convention.

The text was updated successfully, but these errors were encountered:

escapedcat · 2023-03-29T07:44:05Z

Sure, sounds reasonable. Wanna create a PR and extend the test with a non-English language?

amariq · 2023-03-29T07:55:59Z

Oh yeah, gladly!

I'd do it right away, actually, but decided to make an issue first to ask for opinions in case someone would be against modifying the old behaviour 😇. Besides, if I understood it right, the rules can be added/overridden via plugins, so it could be an alternative way of doing it.

amariq · 2023-03-30T09:35:55Z

@escapedcat
Btw, I wonder what such change should really be — feat or fix?

It seems like a feature since it adds what never was there (support for unicode).

But it also seems to be a fix (and breaking at that, probably?) since it "patches" overly-strict RegExp, allowing to process more use-cases again as it was before that alphabetic symbol restriction was introduced (which purpose was mainly on skipping semver string patterns if I catched it right).

What do you think?

escapedcat · 2023-03-30T11:53:00Z

Hm, yeah. Hard to decide for me as well. I feel like as long as all current tests are passing it's non-breaking, does that make sense? We could add it as feature I guess.

amariq · 2023-03-30T12:14:02Z

Yup, tests are passing. But those future changes might affect some scenarios that could be relying on fact that messages starting with some arbitrary symbols always considered valid. Or maybe I'm just over-thinking it... 😅

Can we ping someone who manages releases or something, maybe? To help sort this out.

I'll mark it as a feature in the meantime: commit is here.

PS. What I can think of:

English (Latin alphabet) letters: no changes
Numeric or other actually special symbols: no changes
non-English (non-Latin alphabet) letters: are now not preventing the rule from running, thus check results may vary

escapedcat · 2023-03-30T12:18:38Z

True, I get your point.
Let's try to summon @knocte or @dangreen

amariq · 2023-03-31T04:33:29Z

I dug some more while we waiting and looked at the commit that added [a-z] check.

The reasoning was just to prevent case check fail on version-like strings (ie. 1.0.0), not to limit the character set it seems. But that blocked any strings which start with non-Latin letter from actually being processed anyway.

I think the strings like what would fail the check not because they don't start with a letter but because they only consist of symbols that do not have a case, so to speak (like numbers, punctuation symbols and even 'letters' in some languages, etc).

So, instead of messing with Unicode, we could probably simply replace RegExp with subject.toLowerCase() === subject.toUpperCase() to filter out 'case-less' strings that mess up the check result. But it would have even more impact on backward compatibility than Unicode-solution since then even strings like 1.0.0-beta.1 would've get checked and may lead to a different result... 🤔

escapedcat · 2023-03-31T16:18:08Z

Also tagging @gilmoreorless maybe you have some opinion as well.

gilmoreorless · 2023-04-08T11:50:43Z

Ha, that's what I get for having opinions on breaking changes; suddenly people think I have more opinions about breaking changes! 😆

So, keeping in mind that I'm not a maintainer of this project (and have only really looked at the code that relates to plugins, not the core rules themselves), here are my thoughts:

First up, yay for proper Unicode support — nice work @amariq!
I see the supporting of extended Unicode character ranges as an enhancement, so a feat scope. I get the concerns about possible breakages, but I think it's unlikely in reality. If someone's commit format does get broken by this change, it indicates that their config probably wasn't right in the first place, and it wasn't doing what they thought it was.
1. Case in point: I've worked at several places that enforce an issue tracker ticket at the start of every commit, in the form of feat: [ABC-123] Actual message. This format completely bypasses the subject-case rule because the first character is a symbol [. So the config is kind of useless and varies across projects. If commitlint was ever updated to use only the text after the issue reference, some projects would break — but it would be because their config was already incorrect and no-one noticed. 😉
Do any of the other *-case rules support extended Unicode characters? It could be confusing to have support only in subject-case but nothing else.

amariq · 2023-04-08T15:27:02Z

It could be confusing to have support only in subject-case but nothing else

Yep, I agree.
I checked other case-related rules and found that only header-case rule has the same restriction as the subject-case had... 🤔

This format completely bypasses the subject-case rule

Yes, I also been thinking about that.

And I think it could be fixed by moving always/never condition handling inside the ensure-case function.

That way, we would be able to check if a message can even be validated properly in the first place — that is, it does not consist of case-less symbols only nor contain one at the positions to be transformed — because comparing '123' with never upperFirst('123') otherwise will obviously never pass anyway. Or even try to strip/replace sequences of case-less symbols out of original message to only check the transformation of actual letters.

Plus, as a bonus, no more worrying about Unicode here and there. But since it would be better to refactor all case-related rules at once and me having missed header-case back then, I decided to start with 'fixing' the subject-case. 😇

escapedcat · 2023-04-10T09:47:25Z

Ha, that's what I get for having opinions on breaking changes; suddenly people think I have more opinions about breaking changes! 😆

😆 It's because you do it so well! (btw can I quote this in a tweet?)

[...] I'm not a maintainer of this project [...]

Wanna talk?

Jokes aside, thanks for your feedback, apprecaited and helpful! ❤️

@amariq I guess then we could continue and merge your fetaure, right?

gilmoreorless · 2023-04-10T11:23:51Z

It's because you do it so well! (btw can I quote this in a tweet?)

Haha go for it.

Wanna talk?

I've already fallen for that trap once; not again! 😆

escapedcat · 2023-04-10T11:25:24Z

I've already fallen for that trap once; not again! 😆

amariq · 2023-04-10T11:33:00Z

I guess then we could continue and merge your fetaure, right?

If everything seems fine to you then yeah, I think its good to go 😀👌

amariq added the feature label Mar 29, 2023

amariq mentioned this issue Mar 31, 2023

feat(rules): expand Latin-only characters limitation for subject-case rule with Unicode support #3575

Merged

7 tasks

escapedcat closed this as completed Apr 13, 2023

amariq mentioned this issue Apr 14, 2023

fix(rules): avoid processing strings with case-less Letter category symbols in subject-case #3586

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support unicode for `subject-case` rule #3574

feat: support unicode for `subject-case` rule #3574

amariq commented Mar 29, 2023

escapedcat commented Mar 29, 2023

amariq commented Mar 29, 2023

amariq commented Mar 30, 2023

escapedcat commented Mar 30, 2023

amariq commented Mar 30, 2023 •

edited

escapedcat commented Mar 30, 2023

amariq commented Mar 31, 2023

escapedcat commented Mar 31, 2023

gilmoreorless commented Apr 8, 2023

amariq commented Apr 8, 2023

escapedcat commented Apr 10, 2023

gilmoreorless commented Apr 10, 2023

escapedcat commented Apr 10, 2023

amariq commented Apr 10, 2023

feat: support unicode for subject-case rule #3574

feat: support unicode for subject-case rule #3574

Comments

amariq commented Mar 29, 2023

Expected Behavior

Current Behavior

Affected packages

Possible Solution

Context

escapedcat commented Mar 29, 2023

amariq commented Mar 29, 2023

amariq commented Mar 30, 2023

escapedcat commented Mar 30, 2023

amariq commented Mar 30, 2023 • edited

escapedcat commented Mar 30, 2023

amariq commented Mar 31, 2023

escapedcat commented Mar 31, 2023

gilmoreorless commented Apr 8, 2023

amariq commented Apr 8, 2023

escapedcat commented Apr 10, 2023

gilmoreorless commented Apr 10, 2023

escapedcat commented Apr 10, 2023

amariq commented Apr 10, 2023

feat: support unicode for `subject-case` rule #3574

feat: support unicode for `subject-case` rule #3574

amariq commented Mar 30, 2023 •

edited