Regular expressions are a text matching pattern described with a formal syntax

Good for find repetition, matching text between files, etc.

A lot of parsing problems can be solved with regular expressions.

Regex module in Python is re

This lecture will cover fundamentals - refer to bootcamp notebook on regex and re documentation for more examples

In [1]:
# search for a pattern in text

import re

In [2]:
# make list of patterns

patterns = ['term1', 'term2']

In [8]:
# make string object for analysis

text = 'this is a string with term1 but not the other term'

In [4]:
# search method passes in a pattern and object for testing

re.search(pattern = 'hello', string = 'hello_world') # returns a match object

<_sre.SRE_Match object; span=(0, 5), match='hello'>

In [11]:
# see if the patterns were in the text

for pattern in patterns:
    
    print(f"\nSearching for '{pattern}' in: \n'{text}'")
    
    if re.search(pattern, text):
        print('\n')
        print("Match found!")
    else:
        print('\n')
        print("No match")


Searching for 'term1' in: 
'this is a string with term1 but not the other term'


Match found!

Searching for 'term2' in: 
'this is a string with term1 but not the other term'


No match


In [12]:
# lets take a closer look at match objects using term1 and text

match = re.search(patterns[0], text)

In [13]:
type(match)

_sre.SRE_Match

In [14]:
# match objects are more complex than a boolean

# also contains info about the match, the regular expressions used, location of the match, and original input string

match.start() # index of starting position

22

In [15]:
match.end() # index of ending position

27

In [16]:
match.span() # tuple of start and end positions

(22, 27)

In [17]:
# splitting with regular expressions (very useful)

split_term = '@'

phrase = 'What is your email? Is it hello@gmail.com?'

In [18]:
# split operates very similar to .split() we've used before

re.split(split_term, phrase)

['What is your email? Is it hello', 'gmail.com?']

In [19]:
# re.findall() finds all instances of a match

re.findall('match', 'here is one match and also another match')

['match', 'match']

In [20]:
# we will now use meta characters to assist us in finding patterns

# lets create a function that will print out results given a list of various regular expressions and a phrase

def multi_re_find(patterns, phrase):
    '''
    takes in list of all regex patterns and prints a list of all matches
    '''
    for pattern in patterns:
        
        print("Searching for the phrase: \n{}".format(pattern))
        
        print(re.findall(pattern, phrase))
        
        print('\n')

**from bootcamp regex file:**

### Repetition Syntax

There are five ways to express repetition in a pattern:

   1. A pattern followed by the meta-character <code>*</code> is repeated zero or more times. 
   2. Replace the <code>*</code> with <code>+</code> and the pattern must appear at least once. 
   3. Using <code>?</code> means the pattern appears zero or one time. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [22]:
test_phrase = 'sdsd..sssddd...sdddsdd...dsds...dsssss...sdddd'

test_patterns = ['sd*',         # s followed by zero or more d's - really just searching for 's'
                'sd+',          # s followed by one or more d's - searching for 'sd'
                'sd?',          # s followed by zero or one d's - searching for 'sd' or 'sdd' without any more d's 
                'sd{3}',        # s followed by three d's - searching for 'sddd' 
                'sd{2,3}',      # s followed by two to three d's - searching for 'sdd' or 'sddd'
                ]

multi_re_find(test_patterns,test_phrase)

Searching for the phrase: 
sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sdd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching for the phrase: 
sd+
['sd', 'sd', 'sddd', 'sddd', 'sdd', 'sd', 'sdddd']


Searching for the phrase: 
sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching for the phrase: 
sd{3}
['sddd', 'sddd', 'sddd']


Searching for the phrase: 
sd{2,3}
['sddd', 'sddd', 'sdd', 'sddd']




**Character Sets**

Character sets are used when you wish to match any one of a group of characters at a point in the input

Brackets are used to construct character set inputs

Example: the input [ab] searches for occurrences of either a or b

In [23]:
test_phrase = 'sdsd..sssddd...sdddsdd...dsds...dsssss...sdddd'

test_patterns = ['[sd]',  # either s or d
                 's[sd]+' # s followed by one or more s or d (adding repetition syntax to character sets)
                ]

multi_re_find(test_patterns, test_phrase)

Searching for the phrase: 
[sd]
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching for the phrase: 
s[sd]+
['sdsd', 'sssddd', 'sdddsdd', 'sds', 'sssss', 'sdddd']




In [24]:
# moving onto Exclusion

# we can use ^ to exclude terms by incorporating it into the bracket syntax notation

# example: [^...] will match any single character not in the brackets

test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [25]:
# we use [^ ?!.] to check for non punctuation matches and + to check that the match appears at least once

re.findall('[^ ?!.]+', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

**Character ranges**

As character sets grow larger, typing all characters that should or should not be matched can be tedious

A more compact format using character ranges lets you define a character set to include all contiguous characters between a start and stop point

The format used is [start-end]

e.g. [a-f] will return matches with any instances of letters between a and f

In [26]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(test_patterns,test_phrase)

Searching for the phrase: 
[a-z]+
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching for the phrase: 
[A-Z]+
['T', 'L']


Searching for the phrase: 
[a-zA-Z]+
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching for the phrase: 
[A-Z][a-z]+
['This', 'Lets']




**Escape codes**

You can use escape codes to find specific types of patterns in data (e.g. digits, whitespace, non-digits, and more)

Escapes are prefixed with a backslash \

digit = \d

non digit = \D

whitespace = \s

non whitespace = \S

alphanumeric = \w

non alphanumeric = \W

We add an r before these patterns in regex to tell Python we are looking for escape codes (sometimes hard to read)

In [27]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(test_patterns,test_phrase)

Searching for the phrase: 
\d+
['1233']


Searching for the phrase: 
\D+
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching for the phrase: 
\s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching for the phrase: 
\S+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching for the phrase: 
\w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching for the phrase: 
\W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']


