Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that non-ASCII strings match themselves #16

Merged
merged 1 commit into from
Aug 30, 2019

Conversation

bk2204
Copy link
Member

@bk2204 bk2204 commented Aug 28, 2019

In the wildmatch code, we escape the strings in a pattern by iterating over each byte and then appending the result of each iteration of the loop to a new string. If the character is not part of an escape sequence, we append it by casting it to a string.

However, this does not produce the expected results. When casting an integral value, such as a byte, to a string, the integral value is interpreted as a rune. Consequently, each byte with value greater than 127 in the original string was being turned into a UTF-8 sequence corresponding to that byte's value in Latin-1.

Because a string which is overencoded does not match the original string, any attempt to have a non-ASCII string match itself would fail. Solve this by taking our byte value and first creating a byte slice, and then casting that to a string.

For the curious, Google Translate reports that the string used in the tests is Chinese for "hello world". Using a string that is not in Latin-1 is preferable to a Latin-1 string because it makes it less likely that we're getting things right by accident.

/cc @yunshan as reporter
Fixes git-lfs/git-lfs#3794.

In the wildmatch code, we escape the strings in a pattern by iterating
over each byte and then appending the result of each iteration of the
loop to a new string.  If the character is not part of an escape
sequence, we append it by casting it to a string.

However, this does not produce the expected results.  When casting an
integral value, such as a byte, to a string, the integral value is
interpreted as a rune.  Consequently, each byte with value greater than
127 in the original string was being turned into a UTF-8 sequence
corresponding to that byte's value in Latin-1.

Because a string which is overencoded does not match the original
string, any attempt to have a non-ASCII string match itself would fail.
Solve this by taking our byte value and first creating a byte slice, and
then casting that to a string.

For the curious, Google Translate reports that the string used in the
tests is Chinese for "hello world".  Using a string that is not in
Latin-1 is preferable to a Latin-1 string because it makes it less
likely that we're getting things right by accident.
@bk2204 bk2204 requested a review from a team August 28, 2019 16:42
wildmatch.go Show resolved Hide resolved
@bk2204 bk2204 merged commit 97e6412 into git-lfs:master Aug 30, 2019
@bk2204 bk2204 deleted the unicode-matching branch August 30, 2019 13:41
bk2204 added a commit that referenced this pull request Sep 10, 2019
Ensure that non-ASCII strings match themselves
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

git lfs pull skip download file
2 participants