Skip to content

Conversation

@axunonb
Copy link
Member

@axunonb axunonb commented Nov 6, 2025

Changes

Selectors in Placeholders may now contain most Unicode characters when ParserSettings.SelectorCharFilter = SelectorFilterType.VisualUnicodeChars.

Disallowed characters are:

  • Characters with special functions: {}[]()\.?,
  • Unicode characters are allowed in a selector, except 68 non-visual characters: Control Characters (U+0000–U+001F, U+007F), Format Characters (Category: Cf), Directional Formatting (Category: Cf), Invisible Separator, Common Combining Marks (Category: Mn), Whitespace Characters (non-glyph spacing).
    Merged from feat: Filter Selector chars by allowlist or blocklilst #511:
  • Add class CharSet. It represents a set of characters that supports efficient storage and lookup for both ASCII and non-ASCII characters. It is used in the Parser as allow list or block list. The speed for parsing Placeholdera decreases by ~25% compared to v3.2.0 to v3.6.1.
  • Update Parser to use CharSet and handle the defined FilterType
  • Refactor ParserSettings: Re-order members, update internal properties to better align with class CharSet.

Example:

const string expected = "The Value";
var settings = new SmartSettings 
    { Parser = new ParserSettings { SelectorCharFilter = SelectorFilterType.VisualUnicodeChars } };
var smart = Smart.CreateDefaultSmartFormat(settings);
// Use the Unicode string as a selector of the placeholder
var template = "{Chinese 汉字测试}";
// Instead of the Dictionary, any other type supporting Unicode characters can be used
var result = smart.Format(template, new Dictionary<string, string> { { "Chinese 汉字测试", expected } });
Assert.That(result, Is.EqualTo(expected));

ParserSettings.SelectorCharFilter = SelectorFilterType.Alphanumeric is the default and allows alphanumeric characters plus _ and -.

Benchmark

after implementing class CharSet in Parser
Parser.ParseFormat("{SomePlaceholder1}{SomePlaceholder2}{SomePlaceholder3}{SomePlaceholder4}{SomePlaceholder5}");

Method N Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
v3.6.1 1000 1,065 us 231 us 12.7 us 7.48 0.02 - 406.25 KB 3.25
This PR 1000 776 us 81 us 4.5 us 5.65 0.03 - 406.25 KB 3.25

27% faster

Resolves #454

@axunonb axunonb requested a review from karljj1 November 6, 2025 15:15
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98%. Comparing base (058f615) to head (ab623c1).

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #510    +/-   ##
====================================
+ Coverage    97%    98%    +1%     
====================================
  Files        99    100     +1     
  Lines      3431   3558   +127     
====================================
+ Hits       3339   3484   +145     
+ Misses       92     74    -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Refactored internal `ParserSettings` to convert instance-level properties and methods to static or const members.
@axunonb axunonb force-pushed the pr/unicode-in-selectors branch from a9ef0c5 to 6b1dbec Compare November 6, 2025 17:21
karljj1
karljj1 previously approved these changes Nov 7, 2025
Copy link
Collaborator

@karljj1 karljj1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great :)

@axunonb
Copy link
Member Author

axunonb commented Nov 7, 2025

Thanks for the review.

The PR in its present form represents a major breaking change because it fundamentally alters how format strings are parsed:
Any existing application that uses format strings with previously illegal characters (like spaces or non-ASCII Unicode) and relied on the parser to throw an error or fail parsing will now have those format strings parse successfully.

[Edit]
Resolved with #511 merged in this PR

Implement proposals from review:

* SelectorFilterType.Alphanumeric: alphanumeric characters (upper and lower case), plus '_' and '-'
* SelectorFilterType.VisualUnicodeChars: All Unicode characters are allowed in a selector, except 68 non-visual characters: Control Characters (U+0000–U+001F, U+007F), Format Characters (Category: Cf), Directional Formatting (Category: Cf), Invisible Separator, Common Combining Marks (Category: Mn), Whitespace Characters (non-glyph spacing).
@axunonb axunonb requested a review from imprima November 11, 2025 09:58
karljj1
karljj1 previously approved these changes Nov 11, 2025
Copy link
Member

@imprima imprima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent

@axunonb axunonb merged commit 3a4ad6f into main Nov 11, 2025
3 of 5 checks passed
@axunonb axunonb deleted the pr/unicode-in-selectors branch November 11, 2025 12:17
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend the characters allowed for selectors

4 participants