Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

limit allowed characters to UTF-8 range (0x10FFFF) #444

Merged
merged 2 commits into from
Jan 13, 2022

Conversation

IonBazan
Copy link
Contributor

@IonBazan IonBazan commented Oct 21, 2021

Requirements

This change limits valid characters recognized to 0x10FFFF according to UTF-8 specification.

Description of the Change

This is to make the regular expressions PCRE-compliant, where matched characters should not fall out of UTF-8 bounds: https://www.pcre.org/original/doc/html/pcreunicode.html
Since UTF-8 is a standard encoding for PHP files according to PSR-1, I don't see a point supporting any invalid Unicode characters.

Alternate Designs

Benefits

The reason for this change is actually to make GitHub Linguist support this grammar as it sticks to strict PCRE rules.

Before
image

After

image

Possible Drawbacks

Any invalid character will stop being recognized as a variable name but that shouldn't occur in real world.

Applicable Issues

github-linguist/linguist#5522

While awaiting for workflow run approval, let me just confirm that tests are passing locally.

@KapitanOczywisty
Copy link
Contributor

@sadick254 This is ready to be merged. Other PRs could introduce more 7fffffff, so this should probably go after them and some "replace all" might be needed.

@darangi
Copy link
Contributor

darangi commented Jan 13, 2022

@IonBazan could you take a look at the conflicts?

@IonBazan
Copy link
Contributor Author

@darangi fixed 😉

@darangi
Copy link
Contributor

darangi commented Jan 13, 2022

Thanks for the contribution @IonBazan 🙇🏾

@darangi darangi merged commit b029889 into atom:master Jan 13, 2022
@IonBazan IonBazan deleted the utf-8-compliance branch January 13, 2022 10:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants