Ensure that non-ASCII strings match themselves #16

bk2204 · 2019-08-28T16:42:10Z

In the wildmatch code, we escape the strings in a pattern by iterating over each byte and then appending the result of each iteration of the loop to a new string. If the character is not part of an escape sequence, we append it by casting it to a string.

However, this does not produce the expected results. When casting an integral value, such as a byte, to a string, the integral value is interpreted as a rune. Consequently, each byte with value greater than 127 in the original string was being turned into a UTF-8 sequence corresponding to that byte's value in Latin-1.

Because a string which is overencoded does not match the original string, any attempt to have a non-ASCII string match itself would fail. Solve this by taking our byte value and first creating a byte slice, and then casting that to a string.

For the curious, Google Translate reports that the string used in the tests is Chinese for "hello world". Using a string that is not in Latin-1 is preferable to a Latin-1 string because it makes it less likely that we're getting things right by accident.

/cc @yunshan as reporter
Fixes git-lfs/git-lfs#3794.

In the wildmatch code, we escape the strings in a pattern by iterating over each byte and then appending the result of each iteration of the loop to a new string. If the character is not part of an escape sequence, we append it by casting it to a string. However, this does not produce the expected results. When casting an integral value, such as a byte, to a string, the integral value is interpreted as a rune. Consequently, each byte with value greater than 127 in the original string was being turned into a UTF-8 sequence corresponding to that byte's value in Latin-1. Because a string which is overencoded does not match the original string, any attempt to have a non-ASCII string match itself would fail. Solve this by taking our byte value and first creating a byte slice, and then casting that to a string. For the curious, Google Translate reports that the string used in the tests is Chinese for "hello world". Using a string that is not in Latin-1 is preferable to a Latin-1 string because it makes it less likely that we're getting things right by accident.

wildmatch.go

Ensure that non-ASCII strings match themselves

bk2204 requested a review from a team August 28, 2019 16:42

ttaylorr reviewed Aug 28, 2019

View reviewed changes

wildmatch.go Show resolved Hide resolved

ttaylorr approved these changes Aug 30, 2019

View reviewed changes

bk2204 merged commit 97e6412 into git-lfs:master Aug 30, 2019

bk2204 deleted the unicode-matching branch August 30, 2019 13:41

bk2204 added a commit that referenced this pull request Sep 10, 2019

Merge pull request #16 from bk2204/unicode-matching

87c0f52

Ensure that non-ASCII strings match themselves

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that non-ASCII strings match themselves #16

Ensure that non-ASCII strings match themselves #16

bk2204 commented Aug 28, 2019

Ensure that non-ASCII strings match themselves #16

Ensure that non-ASCII strings match themselves #16

Conversation

bk2204 commented Aug 28, 2019