Basic Regex Parser #139

esthicodes · 2023-02-08T11:08:51Z

esthicodes
Feb 8, 2023
Maintainer

The solution for this kind of pattern matching is usually done by recursion. Review first the algorithm below and then read the explanation that follows.

Pseudocode:

function isMatch(text, pattern):
return isMatchHelper(text, pattern, 0, 0)

Input:

text - the text to check,

pattern - the regular expression to be checked,

textIndex - the index the text is checked from

patIndex - the index the pattern is checked from

Output:

true if the text from the index textIndex matches

the regular expression pattern from the pattern Index.

E.g. isMatchHelper(“aaabb”,”cab*”,2, 1) since the text

from index 2 (“abb”) matches the pattern from index 1 (“ab*”)

function isMatchHelper(text, pattern, textIndex, patIndex):
# base cases - one of the indexes reached the end of text or pattern
if (textIndex >= text.length):
if (patIndex >= pattern.length):
return true
else:
if (patIndex+1 < pattern.length) AND (pattern[patIndex+1] == '*'):
return isMatchHelper(text, pattern, textIndex, patIndex + 2)
else:
return false

else if (patIndex >= pattern.length) AND (textIndex < text.length):
    return false

# string matching for character followed by '*'
else if (patIndex+1 < pattern.length) AND (pattern[patIndex+1] == '*'):
    if (pattern[patIndex] == '.') OR (text[textIndex] == pattern[patIndex]):
        return (isMatchHelper(text, pattern, textIndex, patIndex + 2) OR
                isMatchHelper(text, pattern, textIndex + 1, patIndex))
    else:
        return isMatchHelper(text, pattern, textIndex, patIndex + 2)

# string matching for '.' or ordinary char.
else if (pattern[patIndex] == '.') OR
        (pattern[patIndex] == text[textIndex]):
    return  isMatchHelper(text, pattern, textIndex + 1, patIndex + 1)
else:
    return false

Note: in the code, a text of length N begins in position 0 and ends in position N-1.

As we can see, this is a recursive function that has many cases to cover: First the base cases - if textIndex points to the end of the text:

If the pattern also points to the end of the text, then both are the empty string and both match.
If the pattern has some character followed by '', then the only possibility that the text matches is if the character before the '' is matched exactly 0 times, thus incrementing the pattern index forward by two (after the '' symbol).
Otherwise, the text has ended but the pattern has not, thus the strings don’t match. Another similar base case is if the pattern has ended but the text hasn’t, which also means that the two don’t match.
Next we skip to the last four lines in the code, to the cases where pattern[patIndex + 1] is not '', i.e. the current character is not followed by '*':

If the pattern[patIndex] is '.', or the (pattern[patIndex] == text[textIndex]) we proceed to check the next indexes, as the current indexes match.
Otherwise, the text and pattern don’t match.
Finally the case where the current character in the pattern is followed by the '*' sign:

If the current characters match or the pattern in the current index is '.' then:

The 'x*' in the pattern (where x is the current character in the pattern) is matched as a sequence of zero characters - in which case we simply increment patIndex by 2.
The 'x*' is matched to the current character along with the next character - in which case we increment the textIndex only.
Time Complexity: in the worst case, the solution above runs in time exponential in the size of the pattern. Patterns that take the most time, are the ones with multiple '' symbols, that may provoke an exponential number of recursion calls: For example, for the text “bbbbbbbb” and the pattern “....a”, this function will open a new recursive call to itself for every single way to divide the text in four (the number of “.”).

Space Complexity: the space required is also exponential because of the number of recursion calls filling up the stack. There are, in fact, algorithms that solve the matching problem in polynomial time and space. They are based on Nondeterministic Finite State Machines, which we didn’t provide here due to the fact that it requires more knowledge in theoretical computer science.

def is_match(text, pattern):
  if not text:
    return True
  i = 0
  j = 0
  while i < len(text) and j < len(pattern):
    if pattern[j] == "*":
      # while loop
      char = pattern[j-1]
      # text abbb: pattern ab*
      #text = "abbb", pattern = "ab*" -> True
      # text = "abbb", pattern = "a.*" -> True
      while i+1 < len(text) and (char == '.' or text[i+1] == char): # 
        i += 1
    # text = "aa", pattern = "ab*"
    elif pattern[j] != "." and pattern[j] != text[i] and j+1 < len(pattern) and pattern[j+1] != "*":
        return False
    i += 1
    j += 1
  return i == len(text) 
# .*.*.* 

# runtime complexity O(n+m) len(text) = n len(pattern) = m
# space complexity O(1)

print(is_match(text = "", pattern = "a"))
#print(is_match(text = "aa", pattern = "aa"))
#print(is_match(text = "abc", pattern = "a.c"))
#print(is_match(text = "abbb", pattern = "ab*"))
#print(is_match(text = "acd", pattern = "ab*c."))

  

# iterate over text
# 
  
  # input:  text = "aa", pattern = "a" -> False
  # input:  text = "aa", pattern = "a." -> True

  # text = "aa", pattern = "aa" -> True
  
  # text = "abbb", pattern = "ab*" -> True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Regex Parser #139

{{title}}

Replies: 0 comments

Select a reply

Basic Regex Parser #139

esthicodes Feb 8, 2023 Maintainer

Input:

text - the text to check,

pattern - the regular expression to be checked,

textIndex - the index the text is checked from

patIndex - the index the pattern is checked from

Output:

true if the text from the index textIndex matches

the regular expression pattern from the pattern Index.

E.g. isMatchHelper(“aaabb”,”cab*”,2, 1) since the text

from index 2 (“abb”) matches the pattern from index 1 (“ab*”)

Replies: 0 comments

esthicodes
Feb 8, 2023
Maintainer