Incorrect Regex matching in Turkish culture when ignoring case #58958

veanes · 2021-09-10T17:51:32Z

Description

The combination of ignoring case and using intervals that involve \u0130 (Turkish I with dot) and \u0131 (Turkish i without dot) gives wrong matching results as the repo shows.

Configuration

.NET 6.0 preview

Regression?

Seems so. At least the below code works correctly in .NET 5.0.

Other information

Expected behavior is that the following code prints True but it prints False.
The pattern below must trivially match the input because all of the letters fall in the given intervals
IgnoreCase can only add letters (not remove letters) so the match must hold.
If the IgnoreCase option is omitted the code works correctly.

using System.Text.RegularExpressions;
using System.Globalization;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "I\u0131\u0130i";
            string pattern = "[H-J][\u0131-\u0140][\u0120-\u0130][h-j]";

            var culture = CultureInfo.CurrentCulture;
            CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
            Regex re = new Regex(pattern, RegexOptions.IgnoreCase);
            CultureInfo.CurrentCulture = culture;
            Console.WriteLine(re.IsMatch(input));
        }
    }
}

The text was updated successfully, but these errors were encountered:

ghost · 2021-09-10T17:51:36Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The combination of ignoring case and using intervals that involve \u0130 (Turkish I with dot) and \u0131 (Turkish i without dot) gives wrong matching results as the repo shows.

Configuration

.NET 6.0 preview

Regression?

Seems so. At least the below code works correctly in .NET 5.0.

Other information

Expected behavior is that the following code prints True but it prints False.
The pattern below must trivially match the input because all of the letters fall in the given intervals
IgnoreCase can only add letters (not remove letters) so the match must hold.
If the IgnoreCase option is omitted the code works correctly.

using System.Text.RegularExpressions;
using System.Globalization;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "I\u0131\u0130i";
            string pattern = "[H-J][\u0131-\u0140][\u0120-\u0130][h-j]";

            var culture = CultureInfo.CurrentCulture;
            CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
            Regex re = new Regex(pattern, RegexOptions.IgnoreCase);
            CultureInfo.CurrentCulture = culture;
            Console.WriteLine(re.IsMatch(input));
        }
    }
}

Author:	veanes
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `untriaged`
Milestone:	-

GrabYourPitchforks · 2021-09-10T18:05:30Z

Appears to be a legit regression. Per https://docs.microsoft.com/dotnet/standard/base-types/regular-expression-options#default-options, regex matching uses the current culture by default unless RegexOptions.CultureInvariant is passed as an argument.

If this change was intentional, we should add an entry to the breaking changes doc.

stephentoub · 2021-09-21T16:25:59Z

There's a simpler repro that doesn't involve changing cultures:

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        var r = new Regex("[\u0120-\u0130]", RegexOptions.IgnoreCase);
        bool result = r.IsMatch("\u0130");
        Console.WriteLine(result);
    }
}

That prints true on .NET Framework 4.8 and .NET 5.0, but false on .NET 6.0.

This is due to #42282. With that change, the character class being created for [\u0120-\u0130] only includes the one range and doesn't add any lowercase versions of the characters in that range, even though when the matcher then does char.ToLower('\u0130') as part of the match, it produces a new value outside of that range (all cultures with the exception of invariant lowercase \u0130 to \u0069). Before #42282, the set was also augmented to include \u0069.

This is complicated code. #42282 was fixing #36149, which has been around forever. I suggest we revert #42282 for .NET 6, as that PR was just trading off one set of bugs for another, and instead stick with the bugs we've had there since the dawn of this codebase. We can revisit for .NET 7.

cc: @jeffhandley, @pgovind, @danmoseley

danmoseley · 2021-09-21T17:18:46Z

Seems reasonable.

jeffhandley · 2021-09-21T17:18:59Z

Thanks for the investigation, @stephentoub! I agree with the recommendation.

@pgovind -- Can you take the assignment of reverting #42282? I recommend making 2 PRs: 1 that is a simple revert, and one that includes:

Tests that had been added to assert Regex is incorrectly handling casing of some ranges #36149 was fixed, but with the tests marked as ignored, referencing the issue
Tests that illustrate this issue in the simple repro, asserting that with the revert in place, those tests pass

I imagine we'd only port the clean revert PR into 6.0 GA, without porting the extra tests.

pgovind · 2021-09-21T18:20:00Z

I just looked at this locally too and happily found this comment: #42282 (comment) discussing this exact behavior. Without going too deep, it stems from the way we populate the s_lcTable table. Essentially s_lcTable is pre-populated with the en-US culture, but the regex constructor calls culture.char.ToLower() and AddToLowercase which could use a different culture setting than s_lcTable. We could say this bug is a dupe (at the least, related) of #36147.

I'm still ok reverting #42282 if that's what we want to do. Technically this is a regression, yea, but we could reasonably make the case that is an acceptable breaking change and that this will get fixed when #36147 gets fixed.

jeffhandley · 2021-09-21T18:25:14Z

Thanks, @pgovind; that's really helpful info. At this stage in 6.0, I'd prefer to revert, get back to the preexisting behavior, and try again in 7.0.

stephentoub · 2021-09-21T21:45:27Z

we could reasonably make the case that is an acceptable breaking change

FWIW, I don't think it's defensible that these don't all produce the same result:

using System.Text.RegularExpressions;

Console.WriteLine(new Regex("\u0130", RegexOptions.IgnoreCase).IsMatch("\u0130"));
Console.WriteLine(new Regex("[\u012F-\u0130]", RegexOptions.IgnoreCase).IsMatch("\u0130"));
Console.WriteLine(new Regex("[\u012F\u0130]", RegexOptions.IgnoreCase).IsMatch("\u0130"));

dotnet-issue-labeler bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Sep 10, 2021

danmoseley added this to the 6.0.0 milestone Sep 10, 2021

danmoseley added the regression-from-last-release label Sep 10, 2021

GrabYourPitchforks mentioned this issue Sep 10, 2021

Inconsistent Regex matching behavior in InvariantCulture #58956

Closed

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 10, 2021

jeffhandley assigned pgovind Sep 15, 2021

stephentoub assigned stephentoub and unassigned pgovind Sep 21, 2021

stephentoub assigned jeffhandley and unassigned stephentoub Sep 21, 2021

jeffhandley assigned pgovind Sep 21, 2021

jeffhandley added the blocking-RTM label Sep 21, 2021

pgovind mentioned this issue Sep 21, 2021

Revert "Fix incorrect handling of character range and capitalization … #59425

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 21, 2021

danmoseley added blocking-release and removed blocking-RTM labels Sep 22, 2021

stephentoub mentioned this issue Sep 22, 2021

[API Proposal] Add cultureName constructors to GeneratedRegex #59492

Closed

pgovind closed this as completed in #59425 Sep 23, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 23, 2021

stephentoub mentioned this issue Sep 28, 2021

Merge main into feature/regexsrm dotnet/runtimelab#1603

Merged

ghost locked as resolved and limited conversation to collaborators Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Regex matching in Turkish culture when ignoring case #58958

Incorrect Regex matching in Turkish culture when ignoring case #58958

veanes commented Sep 10, 2021

ghost commented Sep 10, 2021

Description

Configuration

Regression?

Other information

GrabYourPitchforks commented Sep 10, 2021

stephentoub commented Sep 21, 2021 •

edited

Loading

danmoseley commented Sep 21, 2021

jeffhandley commented Sep 21, 2021

pgovind commented Sep 21, 2021

jeffhandley commented Sep 21, 2021

stephentoub commented Sep 21, 2021 •

edited

Loading

Incorrect Regex matching in Turkish culture when ignoring case #58958

Incorrect Regex matching in Turkish culture when ignoring case #58958

Comments

veanes commented Sep 10, 2021

Description

Configuration

Regression?

Other information

ghost commented Sep 10, 2021

Description

Configuration

Regression?

Other information

GrabYourPitchforks commented Sep 10, 2021

stephentoub commented Sep 21, 2021 • edited Loading

danmoseley commented Sep 21, 2021

jeffhandley commented Sep 21, 2021

pgovind commented Sep 21, 2021

jeffhandley commented Sep 21, 2021

stephentoub commented Sep 21, 2021 • edited Loading

stephentoub commented Sep 21, 2021 •

edited

Loading

stephentoub commented Sep 21, 2021 •

edited

Loading