Skip to content

Investigate more regex reduction improvements #126104

@danmoseley

Description

@danmoseley

Analyzed all 17,434 real-world patterns from Regex_RealWorldPatterns.json across three tracks:

  • Track A (Tree analysis): Ran smell-detection on reduced trees; found 6 categories of potential improvements. Confirmed existing reductions (Concat-with-Empty, Single-child-Concat/Alternate) are already fully effective. FinalReduce() (PR Improve regex optimizer through investigation of regex optimizer passes #125289) correctly simplifies 195 patterns (1.1%).
  • Track B (Source generator codegen): Sampled 872 patterns then analyzed all 17,434. Most codegen "smells" (large code, many gotos) are inherent to pattern complexity; found 639 capture-free patterns that generate unnecessary backtracking infra.
  • Track C (Cross-engine comparison with Rust): Compared literal extraction strategies against Rust's regex-cli. Discovered .NET misses suffix-based literal search — Rust extracts suffixes that could enable SIMD-accelerated IndexOf on 374+ patterns with no usable prefix.

Priority Ranking (impact × feasibility)

# Finding Perf Win Cost Risk Verdict
1 Suffix search ★★★★★ (5-30x for FindFirstChar on 374+ patterns) Medium (200-400 lines, new feature) Moderate Best ROI — significant win, manageable scope
2 Shared-prefix bailout ★☆☆☆☆ (negligible for 80%, modest for ~50) Trivial (1 line) Near zero Just do it — trivial fix for a clear bug
3 Redundant-Atomic ☆☆☆☆☆ (zero measurable) Trivial (4 lines) Zero Just do it — free cleanup
4 Better atomicity ★★★☆☆ (potentially large but hard to quantify) Hard (100-300 lines, complex analysis) High Defer — high risk, uncertain reward

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions