# Regular Expression Notes



Regular expressions are text matching patterns described with a formal syntax. We may often hear regular expressions referred to as __regex__ or __regexp__ in converstion. Regular expressions can include a variety of rules, from finding repetition, to text matching, and much more. As we advance in Python, we'll see that a lot of our parsing problems can be solved with regular expressions (they are also a __common interview question__).

### Searching for patterns in Text:

One of the most commom users for the __re module__ is for finding patterns in text.

Example: Using the search method in the re module find some text.

In [28]:
import re

# List of patterns to search for
patterns = ['term1','term2']

# Text to phrase
text = 'This is a string with term1, but it does not have the other term.'

for pattern in patterns:
    print ('Searching for "%s" in: \n"%s"'%(pattern,text))
           
    # Check for match
    if re.search(pattern,text):
           print ('\n')
           print ('Match was found. \n')
           
    else:
           print ('\n')
           print ('No Match was found. \n')

Searching for "term1" in: 
"This is a string with term1, but it does not have the other term."


Match was found. 

Searching for "term2" in: 
"This is a string with term1, but it does not have the other term."


No Match was found. 



Now we have seen that __re.search()__ will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned.

In [30]:
# Example

# List of patterns to search for
pattern = 'term1'

# Text tp phrase
text = 'This is a string with term1, but it does not have the other term'

match = re.search(pattern,text)

type(match)

_sre.SRE_Match

Above Match object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match.

### Methods we can use on the match object:

In [31]:
# Show start of match
match.start()

22

In [32]:
# Show end
match.end()

27

## Split with regular expressions

It is similar to using the split() method with strings.

In [23]:
# Term to split on
split_term = '@'

phrase = 'What is the domain name of someone with the email: hello@gmail.com'

# Split the phrase
re.split(split_term,phrase)

['What is the domain name of someone with the email: hello', 'gmail.com']

Note: 
__re.split()__ returns a list with the term to split on removed and then the terms in the list are a split up version of the string.

### Finding all instances of a pattern

We can use __re.findall()__ to find all the instances of a pattern in a string.

In [24]:
# Returns a list of all matches
re.findall('match','test phrase match is in middle')

['match']

### Pattern re Syntax

Regular expressions support a huge variety of patterns.
We can use __meta characters__ along with re to find specific types of patterns.

Example: 
Let's create a function that will print out results given a list of various regular expressions and a phrase to phrase:

In [25]:
# Example

def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    
    for pattern in patterns:
        print ('Searching the phrase using the re check: %r'%pattern)
        print (re.findall(pattern,phrase))
        print ('\n')

### Repetition Syntax

There are five ways to express repetition in a pattern:

* Pattern followed by the metacharacter * is repeated zero or more times.
* Replace  the * with + and the pattern must appear at least once.
* Using ? means the pattern appears zero or one time.
* For a specific number of occurences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.
* Use {m,n}, where m is the minimum number of repetitions and n is the maximum. Leaving out n, ({m}) means the value appears atleast m times, with no maximum.

In [34]:
# Example

test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sddd'

test_patterns = ['sd*',       # s followed by zero or more d's
                 'sd+',       # s followed by one or more d's
                 'sd?',       # s followed by zero or one d's
                 'sd{3}',     # s followed by three d's
                 'sd{2,3}',   # s followed by two or three d's
                 'sd{2,3}?',  # s followed by two d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sddd']


Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sddd']


Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}?'
['sdd', 'sdd', 'sdd', 'sdd']




## Regularexpression operations

Regular expressions use the backslash character `('\')` to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write `'\\\\'` as the pattern string, because the regular expression must be `\\`, and each backslash must be expressed as `\\` inside a regular Python string literal.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with `'r'`. So `r"\n"` is a two-character string containing `'\'` and `'n'`, while `"\n"` is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

## Regular Expression Syntax

