# Regular Expressions (regex)

## Regular Expression Cheet Sheet

### Identifiers:
\d = any number
<br>\D = anything but a number
<br>\s = space
<br>\S = anything but a space
<br>\w = any letter
<br>\W = anything but a letter
<br>. = any character, except for a new line
<br>\b = space around whole words
<br>\. = period. must use backslash, because . normally means any character.

### Modifiers:
{1,3} = for digits, you expect 1-3 counts of digits, or "places"
<br>+ = match 1 or more
<br>? = match 0 or 1 repetitions.
<br>* = match 0 or MORE repetitions
<br>$ = matches at the end of string
<br>^ = matches start of a string
<br>| = matches either/or. Example x|y = will match either x or y
<br>[] = range, or "variance"
<br>{x} = expect to see this amount of the preceding code.
<br>{x,y} = expect to see this x-y amounts of the precedng code

### White Space Characters:
\n = new line
<br>\s = space
<br>\t = tab
<br>\e = escape
<br>\f = form feed
<br>\r = carriage return

### Characters to REMEMBER TO ESCAPE IF USED:
. + * ? [ ] $ ^ ( ) { } | \

### Brackets:
[] = quant[ia]tative = will find either quantitative, or quantatative.
<br>[a-z] = return any lowercase letter a-z
<br>[1-5a-qA-Z] = return all numbers 1-5, lowercase letters a-q and uppercase A-Z


# Imports

In [1]:
import re

# Some Useful Functions and Explanations

## re.match()
This method looks for the regular expression at the beginning of the string only.

In [3]:
text = 'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.'

# This match comes back positive because 'The' is at the beginning of the string
print(re.match('The ', text))
# This match returns "none" because 'the' is not at the beginning of the string
print(re.match('the ', text))

<_sre.SRE_Match object; span=(0, 4), match='The '>
None


## re.search()
This method looks for the regular expression anywhere in the string.

In [4]:
text = 'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.'

# all of these return positive 
print(re.search('The', text))
print(re.search('the', text))
print(re.search('on', text))

<_sre.SRE_Match object; span=(0, 3), match='The'>
<_sre.SRE_Match object; span=(50, 53), match='the'>
<_sre.SRE_Match object; span=(25, 27), match='on'>


Notice how it also returns the positions of the found string. Also notice that "on" is in the string twice, but re.search() only returns the position of the first one.

## re.findall()
This function returns a list of all of the matches to the regular expression in the string.

In [5]:
example_text = 'Jessica is 15 years old, and Daniel is 27 years old. Edward is 97 years old, and his grandfather, Oscar, is 102.'

# finding all of the ages:
# this expression finds all of the numbers that are 1, 2, or 3 digits long
ages = re.findall('\d{1,3}',example_text)

# finding all of the names:
'''this expression finds all of the words beginning with a capital letter A-Z, 
followed by any number of lowercase letters a-z'''
names = re.match('[A-Z][a-z]*',example_text)

print(ages)
print(names)

['15', '27', '97', '102']
<_sre.SRE_Match object; span=(0, 7), match='Jessica'>


# Other Useful Functions

re.compile(pattern, flags) -> Compile a regular expression of pattern, with flags

re.split(pattern, string) -> Split string by occurrences of patern

re.sub(pattern, str2, string) -> Replace leftmost non-overlapping occurrences of pattern in string with str2

re.fullmatch(pattern, string) -> Match pattern if whole string matches regular expression

re.findall(pattern, string) -> Return all non-overlapping matches of pattern in string, as a list of strings

re.finditer(pattern, string) -> Return an iterator yielding match objects over non-overlapping matches of pattern in 
string

re.subn(pattern, str2, string) -> Replace left most occurrences of pattern in string with str2, but return a tuple of 
(newstring, # subs made)

re.purge() -> Clear the regular expression cache


In [10]:
#remember the chipotle exercise where we wanted to convert dolllars to numeric
dollars = '$4.99 '
#there is a dollar sign and a space at the end. Remove it (many ways to accomplish this)
dollars = re.sub('\$','', dollars)
dollars = re.sub('\s','',dollars)
dollars

'4.99'

In [14]:
#what if the numbers are inconsistently inputed
dollars = ' $ 4,999,910.01 '
number = re.sub('[^0-9.]', '', dollars)
number

'4999910.01'

### A good website with a cheat sheet and some practice exercises
https://www.shortcutfoo.com/app/dojos/python-regex/cheatsheet

#### Other Sourses
https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/

https://github.com/rexdwyer/Splitsville/blob/master/Splitsville.ipynb