Skip to content

ISXB-449 fix for turkish letter i#1664

Merged
ritamerkl merged 4 commits intodevelopfrom
UUM-31521-fix-case-insensitive-reformatting
Apr 18, 2023
Merged

ISXB-449 fix for turkish letter i#1664
ritamerkl merged 4 commits intodevelopfrom
UUM-31521-fix-case-insensitive-reformatting

Conversation

@ritamerkl
Copy link
Copy Markdown
Collaborator

@ritamerkl ritamerkl commented Mar 29, 2023

Description

Peter Pimley observes this bug, he suggested someone else of the team should look over it because there could be more similar cases where bugs like this appear. So it would be great if the reviewer could validate that this solution is appropriate
for all cases that may come up.

Changes made

FIXED: Changed char comparison to use ToLower(CultureInfo) or ToLowerInvariant() to avoid problems with special characters like the turkish letter i.
ADDED: Test for turkish culture
Link to the Bug
Link to the user Ticket

Notes

Due to the unfortunate case that James is sick the PR gets merged without his approval.

Checklist

Before review:

  • Changelog entry added.
    • Explains the change in Changed, Fixed, Added sections.
    • For API change contains an example snippet and/or migration example.
    • FogBugz ticket attached, example ([case %number%](https://issuetracker.unity3d.com/issues/...)).
    • FogBugz is marked as "Resolved" with next release version correctly set.
  • Tests added/changed, if applicable.
    • Functional tests Area_CanDoX, Area_CanDoX_EvenIfYIsTheCase, Area_WhenIDoX_AndYHappens_ThisIsTheResult.
    • Performance tests.
    • Integration tests.
  • Docs for new/changed API's.
    • Xmldoc cross references are set correctly.
    • Added explanation how the API works.
    • Usage code examples added.
    • The manual is updated, if needed.

During merge:

  • Commit message for squash-merge is prefixed with one of the list:
    • NEW: ___.
    • FIX: ___.
    • DOCS: ___.
    • CHANGE: ___.
    • RELEASE: 1.1.0-preview.3.

@unity-cla-assistant
Copy link
Copy Markdown

unity-cla-assistant commented Mar 29, 2023

CLA assistant check
All committers have signed the CLA.

@peter-pimley-unity
Copy link
Copy Markdown

There are many places in the InputSystem source code where either string.ToLower or char.ToLower (or ToUpper) are called. I guess that many (perhaps all) of them are done with the end goal of performing case-insensitive comparisons. For example here is one that attempts to do a case-insensitive match against the string "ignore". Will that suffer from the same problem?

Ideally we would do an audit of all occurances of ToLower or ToUpper on strings and chars in the source, and establish whether they should be culture-aware. I would guess that nearly all of them should not be.

There are some tips for string comparisons here: https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings

@ritamerkl ritamerkl requested a review from jimon March 29, 2023 15:52
@jimon
Copy link
Copy Markdown
Contributor

jimon commented Mar 29, 2023

There are definitely a lot of places and likely just this fix is not sufficient, but I'm struggling to find a good strategy how to fix all of them, maybe just by using IDE "find all" tool?

@peter-pimley-unity
Copy link
Copy Markdown

Yes I think simply a search in Visual Studio etc. for "ToUpper" and "ToLower". The number of occurrences is high enough that it's more than just a 5 minute job but not so large as to be infeasible.

Comment thread Packages/com.unity.inputsystem/InputSystem/Plugins/HID/HID.cs Outdated
Comment thread Packages/com.unity.inputsystem/InputSystem/Controls/InputControlPath.cs Outdated
@ritamerkl ritamerkl requested review from jamesmcgill and jimon April 6, 2023 10:56
// thus be shorter than matchTo and still match.

var matchToLowerCase = matchTo.ToLower();
var matchToStringUpper = matchTo.ToString().ToUpperInvariant();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ToString() will cost a GC allocation as it needs to create a new temporary string.

I looked at matchTo and why you couldn't just to ToUpperInvariant and I found it's because it's an InternedString. The docs say this is a special string class designed exactly for what you are doing here, case-insensitive comparisons.

However that class looks like it needs to be updated too as it might also be broken for turkish letters. Maybe it makes sense for it to store internally in uppercase instead and then you wouldn't need ToString() here either. I'm not sure how much InternedString is used, so it might be quite a bit of work.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of work. changing InternedString is a big thing because it entails changes at many spots

Copy link
Copy Markdown
Collaborator

@lyndon-unity lyndon-unity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggestions on comparison without moving to uppercase or altering InternedString

return true; // Wildcard at end of string so rest is matched.

++posInStr;
nextChar = char.ToLower(str[posInStr]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a fix be to convert to pass CultureInfo.InvariantCulture

E.g.
nextChar = char.ToLower(str[posInStr], CultureInfo.InvariantCulture);
or (if version allowed) :
nextChar = char.ToLowerInvariant(str[posInStr]);

and a few lines down

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToUpperInvariant is recommended here: https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings
I believe going in the other direction (lower) creates issues.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James. I missed that comment in the Microsoft documents. I mainly suggested this as it seemed we were missing the invariantCulture altogether but agree we may need the upper case for a more complete fix.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToUpperInvariant is using Invariant Culture rules so that should be enough without needing to pass CultureInfo.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyndon-unity @jamesmcgill I just updated whole branch to only use culture invariant char comparisons, this fixes the bug without changing to ToUpperInvariant(). I would suggest closing the ticket (as the bug is fixed) at this state and creating a new refactoring ticket for clean up the string ToLowerInvariant -> ToUpperInvariant situation. Maybe it even makes sense to do it in a bigger scale where we look at all our string storing and comparison situation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a good step and fixes the current but. We have captured the future step in a task here:
https://jira.unity3d.com/browse/ISX-1380

if (posInMatchTo == matchToLength)
return false; // Matched all the way to end of matchTo but there's more in str after the wildcard.
}
else if (char.ToLower(nextChar) != matchToLowerCase[posInMatchTo])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above comment

}

var charInComponent = component[indexInComponent];
if (charInComponent == nextCharInPath || char.ToLower(charInComponent) == char.ToLower(nextCharInPath))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and again as above

var first = firstList[startIndexInFirst + i];
var second = secondList[startIndexInSecond + i];

if (char.ToLower(first) != char.ToLower(second))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and one more as above

// thus be shorter than matchTo and still match.

var matchToLowerCase = matchTo.ToLower();
var matchToStringUpper = matchTo.ToString().ToUpperInvariant();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why InternedString returns str.toString rather than str.m_StringOriginalCase
The later could avoid the GC allocation ?

https://sourcegraph.com/github.com/Unity-Technologies/InputSystem@30277637e5e9986c8cad3be939ad16185ba6114a/-/blob/Packages/com.unity.inputsystem/InputSystem/Utilities/InternedString.cs?L27:19-27:33

public static implicit operator string(InternedString str)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But agree switching to uppercase internals in InternedString seems the right thing to do

@ritamerkl ritamerkl force-pushed the UUM-31521-fix-case-insensitive-reformatting branch from 3027763 to 1f76800 Compare April 12, 2023 08:47
@ritamerkl ritamerkl requested review from jamesmcgill, jimon and lyndon-unity and removed request for jimon April 12, 2023 08:53
Copy link
Copy Markdown
Collaborator

@lyndon-unity lyndon-unity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this as a fix for the immediate issue and then raising a separate task (not bug) on the backlog to refactor wider to use the upper case version which is a better long term fix but a bigger change which we don't have time for this sprint.

However there is a test failure outstanding

IntegrationTests.Integration_CanSendAndReceiveEvents

@ritamerkl ritamerkl requested a review from lyndon-unity April 12, 2023 14:59
Copy link
Copy Markdown
Collaborator

@lyndon-unity lyndon-unity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm happy to proceed with this fix but ideally would like to see a new test case added to the input system test suite to make sure we do not regress with future changes.

@ritamerkl
Copy link
Copy Markdown
Collaborator Author

ritamerkl commented Apr 14, 2023

Overall I'm happy to proceed with this fix but ideally would like to see a new test case added to the input system test suite to make sure we do not regress with future changes.

After writing the test for this issue I came across the "real problem":
The previous ToLower() method converted the normal big I into a small turkish dotless ı. I found this info on the microsoft page:

"The casing operation that results from calling the ToLower() method takes the casing conventions of the current culture into account." See here

So only by using the big 'I' the wrong conversion happens. A normal 'i' exists as well in the turkish language. The two big 'I's' do not differ from each other, that's where the problem arises.
The ToLower() method with the CultureInfo property prevents this, so we are all good now.

@ritamerkl ritamerkl requested a review from lyndon-unity April 14, 2023 10:11
@ritamerkl ritamerkl changed the title UUM-31521 fix for turkish letter i ISXB-449 fix for turkish letter i Apr 14, 2023
@ritamerkl ritamerkl removed the request for review from jamesmcgill April 18, 2023 13:19
@ritamerkl ritamerkl dismissed jamesmcgill’s stale review April 18, 2023 13:21

James is off sick and we agreed to land this PR to follow up on the discussed points in another ticket later

@ritamerkl ritamerkl merged commit 811e5c5 into develop Apr 18, 2023
@ritamerkl ritamerkl deleted the UUM-31521-fix-case-insensitive-reformatting branch April 18, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants