## Regular Expressions

Regular expressions are text-matching patterns described with a formal syntax. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. 

### Searching for Patterns in Text
One of the most common uses for the <code>re</code> module is for finding patterns in text:

In [1]:
import re

In [2]:
# List of patterns to search for
patterns = ['term1', 'term2']

In [3]:
# Text to parse
text = 'This is a string with term1 but not the other term'

In [4]:
for pattern in patterns:
    print(f'Searching for {pattern} in:\n {text}\n')
    
    #Check for match
    if re.search(pattern, text):
        print('Match was found. \n')
    else:
        print('No Match was found.\n')

Searching for term1 in:
 This is a string with term1 but not the other term

Match was found. 

Searching for term2 in:
 This is a string with term1 but not the other term

No Match was found.



<code>re.search</code> return a <code>Match</code> object if pattern is found. If no pattern is found, <code>NoneType</code> is returned

In [5]:
type(re.search('h', 'w'))

NoneType

In [6]:
type(re.search('hello', 'hello world'))

_sre.SRE_Match

This <code>Match</code> object returned by the <code>re.search()</code> method is more than just a Boolean or NoneType, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match:

In [7]:
match = re.search(patterns[0], text)

In [8]:
match.start()

22

In [9]:
match.end()

27

### Split with regular expression

In [10]:
split_term = '@'

In [11]:
phrase = 'What is your email, is it hello@gmail.com?'

In [12]:
re.split(split_term, phrase)

['What is your email, is it hello', 'gmail.com?']

### Finding all instances of a pattern

In [13]:
re.findall('match', 'Here is one match, here is another match')

['match', 'match']

In [14]:
def multi_re_find(patterns, phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %(pattern))
        print(re.findall(pattern, phrase))
        print('\n')

In [15]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd...sdd'

test_patterns = ['sd*',     # s followed by zero or more d's
                 'sd+',          # s followed by one or more d's
                 'sd?',          # s followed by zero or one d's
                 'sd{3}',        # s followed by three d's
                 'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd', 'sdd']


Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd', 'sdd']


Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd', 'sd']


Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd', 'sdd']




### Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input \[ab\] searches for occurrences of either <b>a</b> or <b>b</b>. 

In [16]:
test_patterns2 = ['[sd]',    # either s or d
                  's[sd]+']   # s followed by one or more s or d

multi_re_find(test_patterns2, test_phrase)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd', 's', 'd', 'd']


Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd', 'sdd']




the first input <code>[sd]</code> returns every instance of s or d. Also, the second input <code>s[sd]+</code> returns any full strings that begin with an s and continue with s or d characters until another character is reached.

## Exclusion

We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets.

In [17]:
test_phrase2 = 'This is a string! But it has punctuation. How can we remove it?'

Use <code>[^!.? ]</code> to check for matches that are not a <code>!, ., ?,</code> or space. Add a <code>+</code> to check that the match appears at least once. This basically translates into finding the words.

In [18]:
re.findall('[^!.? ]+',test_phrase2)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

### Character Ranges

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-f]</code> would return matches with any occurrence of letters between a and f.



In [19]:
test_phrase3 = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns3 = ['[a-z]+',      # sequences of lower case letters
                  '[A-Z]+',      # sequences of upper case letters
                  '[a-zA-Z]+',   # sequences of lower or upper case letters
                  '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(test_patterns3, test_phrase3)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z]+'
['T', 'L']


Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Lets']




### Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. 

Escapes are indicated by prefixing the character with a backslash <code>\</code>. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

In [20]:
test_phrase4 = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns4 = [r'\d+', # sequence of digits
                  r'\D+', # sequence of non-digits
                  r'\s+', # sequence of whitespace
                  r'\S+', # sequence of non-whitespace
                  r'\w+', # alphanumeric characters
                  r'\W+', # non-alphanumeric
                 ]

multi_re_find(test_patterns4, test_phrase4)

Searching the phrase using the re check: '\\d+'
['1233']


Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']


