RegexUtility: miscellaneous string manipulation and regex operations

Ahmad Mageed edited this page Mar 10, 2014 · 11 revisions

The static RegexUtility class features a number of useful methods. A summary of categories and methods appear below. Visit each section for further details and examples.

  • Split Methods

    • Split
    • SplitRemoveEmptyEntries
    • SplitIncludeDelimiters
    • SplitMatchWholeWords
    • SplitTrimWhitespace
  • Formatting Methods

    • TrimWhitespace
    • FormatCamelCase
  • Named Groups Conversion Methods

    • MatchesToNamedGroupsDictionaries
    • MatchesToNamedGroupsLookup

Split Methods

  • Split: performs a split on the given delimiters and accepts flag enum options that can be combined to perform specific actions:
    • SplitOptions.IncludeDelimiters: includes the delimiters in the split result
    • SplitOptions.MatchWholeWords: splits the input by matching whole words based on the delimiters
    • SplitOptions.TrimWhitespace: trims leading and trailing whitespace in split results
    • SplitOptions.RemoveEmptyEntries: removes empty split result entries
    • SplitOptions.All: splits using all of the above SplitOptions
  • SplitRemoveEmptyEntries: this method has 2 overloads
    • convenience method to Split with SplitOptions.RemoveEmptyEntries
    • accepts a regex pattern, performs a Regex.Split, then removes any empty split result entries
  • SplitIncludeDelimiters: convenience method to Split with SplitOptions.IncludeDelimiters
  • SplitMatchWholeWords: convenience method to Split with SplitOptions.MatchWholeWords
  • SplitTrimWhitespace: convenience method to Split with SplitOptions.TrimWhitespace

Split Signature and Note

The Split method's signature is:

string[] Split(string input, string[] delimiters, RegexOptions regexOptions = RegexOptions.None, SplitOptions splitOptions = SplitOptions.None)

Note: all delimiters are escaped. In other words, regex metacharacters are ignored.

The following examples will focus on the various SplitOptions.

Split with SplitOptions.IncludeDelimiters

Splitting usually excludes the delimiters. This option uses a pattern that includes them in the result.

string input = "123xx456yy789";
string[] delimiters = { "xx", "yy" };
var result = RegexUtility.Split(input, delimiters, splitOptions: SplitOptions.IncludeDelimiters);
// { "123", "xx", "456", "yy", "789" }

Split with SplitOptions.MatchWholeWords

Splitting on whole words returns the words which the delimiter is part of, rather than finding whole words and splitting at that point.

string input = "StackOverflow Stack OverStack";
string[] delimiters = { "Stack" };
var result = RegexUtility.Split(input, delimiters, splitOptions: SplitOptions.MatchWholeWords);
// { "StackOverflow ", " OverStack" }

Split with SplitOptions.TrimWhitespace

Without TrimWhitespace the following result would've been: { "Hello ", " World" } (notice the leading/trailing whitespace). Instead, TrimWhitespace cleans that up.

string input = "Hello . World";
string[] delimiters = { "." };
var result = RegexUtility.Split(input, delimiters, splitOptions: SplitOptions.TrimWhitespace);
// { "Hello", "World" }

Split with SplitOptions.RemoveEmptyEntries

Sometimes splitting includes empty entries (""). This option removes those empty entries. Without this option the following would've been: { "", " Hello ", " World", "" }.

string input = "() Hello . World?";
string[] delimiters = { "()", ".", "?" };
var result = RegexUtility.Split(input, delimiters, splitOptions: SplitOptions.RemoveEmptyEntries);
// { " Hello ", " World" }

Split with SplitOptions.All and Custom Combinations

SplitOptions can be combined using the OR | operator. SplitOptions.All combines all the options: IncludeDelimiters | MatchWholeWords | TrimWhitespace | RemoveEmptyEntries.

string input = "Stack StackOverflow Stack OverStack Stack";
string[] delimiters = { "Stack" };
var result = RegexUtility.Split(input, delimiters, splitOptions: SplitOptions.All);
// { "Stack", "StackOverflow", "Stack", "OverStack", "Stack" }

SplitRemoveEmptyEntries

Takes a regex pattern, splits, and removes empty entries. Unlike this method, Regex.Split would've returned: { "", "hello", "world", "goodbye", "", "world", "" }

var input = "x hello x world x goodbye !x world!";
var pattern = @"\s*[x!]\s*";
var result = RegexUtility.SplitRemoveEmptyEntries(input, pattern);
// { "hello", "world", "goodbye", "world" }

Formatting Methods

  • TrimWhitespace
  • FormatCamelCase

TrimWhitespace

Removes leading, trailing, and duplicate whitespace (consecutive whitespace in the middle of inputs).

var result = RegexUtility.TrimWhitespace("   Hello    World   ");
// "Hello World"

FormatCamelCase

Formats PascalCase (upper CamelCase) and (lower) camelCase words to a friendly format separated by the given delimiter (space by default). It also accepts an CamelCaseOptions enum.

It properly handles acronyms too. For example "XML" is properly preserved when given an input of "PickUpXMLInFiveDays". The result is "Pick Up XML In Five Days".

CamelCaseOptions:

  • CapitalizeFirstCharacter: capitalizes the first character of camelCase inputs
  • CapitalizeFirstCharacterInvariantCulture: same as above, using the invariant culture

FormatCamelCase Examples

RegexUtility.FormatCamelCase("PascalCase")        // Pascal Case
RegexUtility.FormatCamelCase("camelCase42", "_")  // camel_Case_42

// Returns "Camel Case" (first C is now uppercase)
RegexUtility.FormatCamelCase("camelCase", camelCaseOptions: CamelCaseOptions.CapitalizeFirstCharacter);

Named Groups Conversion Methods

These methods expect a pattern with named groups and will convert the named groups to specific collections.

  • MatchesToNamedGroupsDictionaries
  • MatchesToNamedGroupsLookup

Matches To Named Groups Dictionaries

Returns an array of Dictionary<string, string> of each match with the named groups as the keys, and the group's corresponding value.

var input = "123-456-7890 hello 098-765-4321";
var pattern = @"(?<AreaCode>\d{3})-(?<First>\d{3})-(?<Last>\d{4})";
var results = RegexUtility.MatchesToNamedGroupsDictionaries(input, pattern);

This code returns the following result:

Named Groups Dictionaries

Matches To Named Groups Lookup

Returns an ILookup<string, string> of each named group as the keys, and the group of corresponding match values.

var input = "123-456-7890 hello 098-765-4321";
var pattern = @"(?<AreaCode>\d{3})-(?<First>\d{3})-(?<Last>\d{4})";
var result = RegexUtility.MatchesToNamedGroupsLookup(input, pattern);

This code returns the following result:

Named Groups Lookup