Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] Add docs explaining the differences between Regex Source generator vs the inbox engines when using RegexOptions.IgnoreCase #70214

Closed
joperezr opened this issue Jun 3, 2022 · 1 comment · Fixed by #73814
Assignees
Labels
area-System.Text.RegularExpressions documentation Documentation bug or enhancement, does not impact product or test code
Milestone

Comments

@joperezr
Copy link
Member

joperezr commented Jun 3, 2022

cc: @GrabYourPitchforks @stephentoub

Today, when using RegexOptions.IgnoreCase, Regex will use the regex case equivalence table to transform the pattern into a case-insensitive equivalent pattern in order to be able to perform the searches in a more performant way. For example, it will convert the case-insensitive pattern A|B into a case-sensitive [Aa]|[Bb] pattern. When using the source generator, this transformation happens at build time since the case-sensitive pattern is embedded into the resulting assembly/program. This is why it is important to call out a couple of differences in terms of behavior that this could result in when compared to using a non-source-generator engine in Regex:

  • The transformation of the case-insensitive pattern into a case-sensitive pattern happens at build time. Due to this, the regex casing table that will be used will be the regex casing table that was present at build time. This means that if the data in the casing table changes from the time the assembly was compiled, to the time the assembly is running, the source generated engine will still be using the data that was present at compile time, as opposed to the one at runtime. For example: imagine that you are a library developer and you write some code that uses the Regex Source generator to perform some case-insensitive searches, which targets .NET 7. This means that the casing transformations of your pattern would be using the .NET 7 regex casing table data. Now imagine that, in a future version of .NET (.NET 8, .NET 9, etc), some application consumes your library, and imagine that one of the letters in your pattern has a new case-mapping in that future version of .NET. Because your library had built against the .NET 7 regex casing table data, it won't be aware of this new mapping, so at runtime, the engine will not match some input that relies on this new mapping in order to match. On the other hand, if you were not using the regex source generator and instead just using one of the built-in engines, this mapping would be considered because the transformation of the case-insensitive pattern into the case-sensitive pattern happens at runtime using the new regex casing tables.
  • As explained in issue [API Proposal] Add cultureName constructors to GeneratedRegex #59492, the CurrentCulture is important when determining which case-mappings will be used to perform the transformation, since there are some (very few) cases where mappings change depending on the culture. When using the source generator, the culture used for determining which bindings to use will be selected at build-time, while if you were using one of the built-in engines, the culture used for determining which bindings to use will be selected at runtime.

This issue will be used to track the work of adding official documentation to call out these differences between source generator engines and the built-in engines.

Related: #69039

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 3, 2022
@ghost
Copy link

ghost commented Jun 3, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

cc: @GrabYourPitchforks @stephentoub

Today, when using RegexOptions.IgnoreCase Regex will use the regex case equivalence table to transform the pattern into a case-insensitive equivalent pattern in order to be able to perform the searches in a more performant way. For example, it will convert the case-insensitive pattern A|B into a case-sensitive [Aa]|[Bb] pattern. When using the source generator, this transformation happens at build time since the case-sensitive pattern is embedded into the resulting assembly/program. This is why it is important to call out a couple of differences in terms of behavior that this could result in when compared to using a non-source-generator engine in Regex:

  • The transformation of the case-insensitive pattern into a case-sensitive pattern happens at build time. Due to this, the regex casing table that will be used will be the regex casing table that was present at build time. This means that if the data in the casing table changes from the time the assembly was compiled, to the time the assembly is running, the source generated engine will still be using the data that was present at compile time, as opposed to the one at runtime. For example: imagine that you are a library developer and you write some code that uses the Regex Source generator to perform some case-insensitive searches, which targets .NET 7. This means that the casing transformations of your pattern would be using the .NET 7 regex casing table data. Now imagine that, in a future version of .NET (.NET 8, .NET 9, etc), some application consumes your library, and imagine that one of the letters in your pattern has a new case-mapping in that future version of .NET. Because your library had built against the .NET 7 regex casing table data, it won't be aware of this new mapping, so at runtime, the engine will not match some input that relies on this new mapping in order to match. On the other hand, if you were not using the regex source generator and instead just using one of the built-in engines, this mapping would be considered because the transformation of the case-insensitive pattern into the case-sensitive pattern happens at runtime using the new regex casing tables.
  • As explained in issue [API Proposal] Add cultureName constructors to GeneratedRegex #59492, the CurrentCulture is important when determining which case-mappings will be used to perform the transformation, since there are some (very few) cases where mappings change depending on the culture. When using the source generator, the culture used for determining which bindings to use will be selected at build-time, while if you were using one of the built-in engines, the culture used for determining which bindings to use will be selected at runtime.

Related: #69039

Author: joperezr
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@joperezr joperezr added the documentation Documentation bug or enhancement, does not impact product or test code label Jun 3, 2022
@joperezr joperezr added this to the 7.0.0 milestone Jun 3, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jun 3, 2022
@joperezr joperezr self-assigned this Jun 3, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 11, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 12, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions documentation Documentation bug or enhancement, does not impact product or test code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant