Regular expressions (regexes) are, when you understand them, one of the most fun things you can work with in programming. They are a mini-language for matching text.

The first thing to know is that non-special characters match themselves in text.

In [2]:
import re
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.



In [3]:
re.search(r"e", "hello")

<_sre.SRE_Match object; span=(1, 2), match='e'>

In [4]:
re.search(r"l", "hello")

<_sre.SRE_Match object; span=(2, 3), match='l'>

You can match more than one letter, of course.

In [73]:
sentence = ("A symmetry of a pattern is -- loosely speaking -- a way of transforming "
            "the pattern so that the pattern looks exactly the same after the "
            "transformation.")

In [6]:
re.search(r"pattern", sentence)

<_sre.SRE_Match object; span=(16, 23), match='pattern'>

`re.search` gives us a match object that has many methods, but only finds the first match.

`re.findall` gives us a list of all matches.

In [7]:
re.findall(r"pattern", sentence)

['pattern', 'pattern', 'pattern']

In [8]:
re.findall(r"at", sentence)

['at', 'at', 'at', 'at', 'at']

## Matching anything

The `.` (period) character matches anything (except a newline). We can use this to find strings that match wildcards, like "a double-o followed by any character."

In [9]:
re.search(r"oo.", sentence)

<_sre.SRE_Match object; span=(29, 32), match='oos'>

See how the match is "oos".

In [10]:
re.findall(r".at.", sentence)

['patt', 'patt', 'hat ', 'patt', 'mati']

In [11]:
re.search(r"\.", sentence)

<_sre.SRE_Match object; span=(147, 148), match='.'>

In [53]:
a = [ "a", "b", "c" ]
   # 0    1    2   3 
    
a[0:1]

['a']

In [None]:
a = " h e l l o "
#    0 1 2 3 4 5  

In [12]:
# Case-insensitive matching
print(re.findall(r"h", "Hello there! How many I help you?"))
print(re.findall(r"h", "Hello there! How many I help you?", re.IGNORECASE))

['h', 'h']
['H', 'h', 'H', 'h']


## What can I do with a match object?

In [13]:
match = re.search("pattern", sentence)
help(match)

Help on SRE_Match object:

class SRE_Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(self, /)
 |  
 |  __deepcopy__(self, /, memo)
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(self, group=0, /)
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(self, /, template)
 |      Return the string obtained by doing backslash substitution on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(self, /, default=None)
 |      Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name.
 |      
 |      default
 |        Is used for groups that did not participate in the match.
 |  
 |  groups(self, /, defau

## Start and end matches

You often want to match something if and only if it is at the beginning or end of a string.

`^` matches the beginning of a string.

`$` matches the end of a string.

In [14]:
re.search(r"^A ", sentence)

<_sre.SRE_Match object; span=(0, 2), match='A '>

In [15]:
print(re.search(r"^pattern", sentence))

None


If I want to match the end of this string, I have to match a period.

In [16]:
re.search(r"n.$", sentence)

<_sre.SRE_Match object; span=(146, 148), match='n.'>

In [17]:
re.search(r"n.$", "I like singing")

<_sre.SRE_Match object; span=(12, 14), match='ng'>

In [54]:
re.search(r"n.$", "I like singin'")

<_sre.SRE_Match object; span=(12, 14), match="n'">

In [56]:
print(re.search(r"^like", "I like singin'"))

None


What happened here? `.` matches anything, so I have to _escape_ it to match just a period.

In [18]:
re.search(r"n\.$", sentence)

<_sre.SRE_Match object; span=(146, 148), match='n.'>

In [57]:
print(re.search(r"n\.$", "I like singing"))

None


## Matching multiples

Often, you want to match a multiple amount of something. Whether it's 0 or more, 1 or more, 0 or 1, or something else, we've got you covered.

* `*` matches 0 or more.
* `+` matches 1 or more.
* `?` matches 0 or 1.
* `{n}` matches `n` repetitions.
* `{m,n}` matches `m` to `n` repetitions. You can leave out `m` or `n` to match 0 to `n`, or `m` to infinity.

In [20]:
re.findall(r"o+", sentence)

['o', 'oo', 'o', 'o', 'o', 'oo', 'o', 'o']

In [58]:
re.findall(r"ng? ", sentence)

['n ', 'ng ', 'n ', 'n ']

In [61]:
no_a = "b"
one_a = "ab"
lots_of_a = "aaaaaaaaaaaab"
mixed_a = "clintonaaaaaaaaabpython"
mixed_b = "clinton b python"

In [62]:
print(re.search(r"a*b", no_a))
print(re.search(r"a*b", one_a))
print(re.search(r"a*b", lots_of_a))
print(re.search(r"a*b", mixed_a))
print(re.search(r"a*b", mixed_b))

<_sre.SRE_Match object; span=(0, 1), match='b'>
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 13), match='aaaaaaaaaaaab'>
<_sre.SRE_Match object; span=(7, 17), match='aaaaaaaaab'>
<_sre.SRE_Match object; span=(8, 9), match='b'>


In [24]:
print(re.search(r"a+b", no_a))
print(re.search(r"a+b", one_a))
print(re.search(r"a+b", lots_of_a))

None
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 13), match='aaaaaaaaaaaab'>


In [25]:
print(re.search("a?b", no_a))
print(re.search("a?b", one_a))
print(re.search("a?b", lots_of_a))

<_sre.SRE_Match object; span=(0, 1), match='b'>
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(11, 13), match='ab'>


In [63]:
def find_zero_or_one_as_and_a_b(string):
    for idx, letter in enumerate(string):
        if letter == "b":
            if idx == 0:
                return letter
            else:
                prev_letter = string[idx - 1]
                if prev_letter == "a":
                    return prev_letter + letter
                else:
                    return letter

In [64]:
print(find_zero_or_one_as_and_a_b(no_a))
print(find_zero_or_one_as_and_a_b(one_a))
print(find_zero_or_one_as_and_a_b(lots_of_a))

b
ab
ab


In [26]:
print(re.search("a{2}b", no_a))
print(re.search("a{2}b", one_a))
print(re.search("a{2}b", lots_of_a))

None
None
<_sre.SRE_Match object; span=(10, 13), match='aab'>


In [27]:
print(re.search("a{1,2}b", no_a))
print(re.search("a{1,2}b", one_a))
print(re.search("a{1,2}b", lots_of_a))

None
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(10, 13), match='aab'>


In [28]:
print(re.search("a{1,}b", no_a))
print(re.search("a{1,}b", one_a))
print(re.search("a{1,}b", lots_of_a))

None
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 13), match='aaaaaaaaaaaab'>


In [29]:
print(re.search("a{,2}b", no_a))
print(re.search("a{,2}b", one_a))
print(re.search("a{,2}b", lots_of_a))

<_sre.SRE_Match object; span=(0, 1), match='b'>
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(10, 13), match='aab'>


In [30]:
re.findall(r"a?b", "ababb")

['ab', 'ab', 'b']

In [69]:
# Find 2 instances of ab
re.search(r"(a+b){3}", "abaaaabaab")

<_sre.SRE_Match object; span=(0, 10), match='abaaaabaab'>

In [70]:
re.findall(r"a+b", "abaaaabaab")

['ab', 'aaaab', 'aab']

## Matching sets of things

All the above is good, but not that useful by itself. Being able to match a group of characters is super-useful.

We use square brackets to do this.

* `[abz]` will match an a, b, or z.
* `[A-Z]` matches a range of letters from A to Z.
* `[^A-Z]` matches anything that _isn't_ A to Z.

In [32]:
# Get words three to five letters long
re.findall(r" [A-Za-z]{3,5} ", sentence)

[' way ', ' the ', ' that ', ' looks ', ' the ', ' after ']

In [33]:
# Find the first number in a string
re.search(r"[0-9]+", "I ate 130 ghost peppers")

<_sre.SRE_Match object; span=(6, 9), match='130'>

In [74]:
# Find all punctuation
re.findall(r"[\.,;?!]", sentence)

['.']

In [75]:
# or
re.findall(r"[^A-Za-z0-9 ]", sentence)

['-', '-', '-', '-', '.']

In [36]:
# Find a phone number
re.search(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", "My phone number is 919-555-1212.")

<_sre.SRE_Match object; span=(19, 31), match='919-555-1212'>

In [77]:
import uuid
a_uuid = uuid.uuid4()
a_uuid

UUID('3a5324f2-26fb-481a-9def-f53a3c02dab3')

In [78]:
re.search(r"[0-9a-f\-]+", str(a_uuid))

<_sre.SRE_Match object; span=(0, 36), match='3a5324f2-26fb-481a-9def-f53a3c02dab3'>

## Character classes

That last match was pretty wordy. Luckily, we have something called _character classes_ for commonly used groups of characters.

* `\d` matches digits.
* `\D` matches _non_-digits.
* `\w` matches "word characters": basically `[a-zA-Z0-9_]`, plus all other valid Unicode characters that can be in words.
* `\W` matches _non_-word-characters.
* `\s` matches space characters -- `[ \t\n\r\f\v]`.
* `\S` matches non-space characters.

In [37]:
# Find a phone number
re.search(r"\d{3}-\d{3}-\d{4}", "My phone number is 919-555-1212.")

<_sre.SRE_Match object; span=(19, 31), match='919-555-1212'>

In [38]:
# Find all punctuation
re.findall(r"[^\w\s]", sentence)

[',', ',', '.']

There's a few odder ones:

* `\A` matches the beginning of the string. This is a lot like `^`, but different for multi-line strings.
* `\Z` matches the end of the string. This is a lot like `$`, but different for multi-line strings.
* `\b` matches a word boundary. This means it matches an empty string at the end of a word.

In [83]:
# Get words three to five letters long
re.findall(r"\b\w{3,5}\b", sentence)

['way', 'the', 'that', 'the', 'looks', 'the', 'same', 'after', 'the']

In [40]:
# Pick out email addresses
possible_emails = ["clinton", "clinton@dreisbach.us", "beanguy@example.org", 
                   "Email help@example.org for more information",
                   "terry@example.org", "@carmen", "what@what", "hi@example.org"]
[possibility 
 for possibility in possible_emails 
 if re.search("\A\w+@\w+\.\w{2,3}\Z", possibility)]

['clinton@dreisbach.us',
 'beanguy@example.org',
 'terry@example.org',
 'hi@example.org']

Note that a regex for emails is more complex than this. It's not that hard, though:

```
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
```

## Making complex regexes easier to read

You can add flags to regexes to make them operate differently. The most important one is the verbose flag.

In [41]:
email_regex = re.compile(r"""\A\w+      # The first part of the address
                             @
                             \w+\.      # The domain without the TLD
                             \w{2,3}\Z  # The TLD
                             """, re.VERBOSE)
email_regex.match("clinton@dreisbach.us")

<_sre.SRE_Match object; span=(0, 20), match='clinton@dreisbach.us'>

## Capturing matches

We often want to capture part of a match for later use. You can use parentheses to mark part of your regex as something you will capture.

In [42]:
# city and state
possibilities = ["Decatur, GA", "Wilkesboro, NC", "Seattle", "Wichita Falls, TX", "DC"]
for possibility in possibilities:
    match = re.search("^([\w\s]+), ([A-Z]{2})", possibility)
    if match:
        city, state = match.groups()
        print("City:", city, "| State:", state)

City: Decatur | State: GA
City: Wilkesboro | State: NC
City: Wichita Falls | State: TX


In [43]:
# Re-format phone numbers for later
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286"]
cleaned = []
for num in phone_nums:
    match = re.search(r"\(?(\d{3})\)?[-.]?\s*(\d{3})[-.]?(\d{4})", num)
    cleaned.append("{}-{}-{}".format(*match.groups()))
print(cleaned)

['999-555-1212', '703-555-9999', '800-555-7341', '314-555-8286']


## Non-capturing group

Use `(?:)` to make a group but not capture it.

In [44]:
phone_num_with_possible_area_code = r"(?:\(?(\d{3})\)?[-.]?\s*)?(\d{3})[\-\.]?(\d{4})"
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286", "555-1212"]
cleaned = []
for num in phone_nums:
    match = re.search(phone_num_with_possible_area_code, num)
    cleaned.append("{}-{}-{}".format(*match.groups()))
print(cleaned)

['999-555-1212', '703-555-9999', '800-555-7341', '314-555-8286', 'None-555-1212']


## Scratching the surface

This is just the beginning with regular expressions. You can go really deep down this hole.

* [Python regex docs](https://docs.python.org/3/library/re.html)
* [Regexr](http://www.regexr.com/)
* [Regex One](http://regexone.com/)
* [Regular-Expressions.info](http://www.regular-expressions.info/)


In [46]:
# Pick out email addresses
possible_emails = ["clinton", "clinton@dreisbach.us", "beanguy@example.org", 
                   "Email help@example.org for more information",
                   "terry@example.org", "@carmen", "what@what", "hi@example.org"]
emails = []
for possibility in possible_emails:
    match = re.search("\w+@\w+\.\w{2,3}", possibility)
    if match:
        emails.append(match.group(0))
emails

['clinton@dreisbach.us',
 'beanguy@example.org',
 'help@example.org',
 'terry@example.org',
 'hi@example.org']

In [47]:
for match in re.finditer(r"a*b", "ccccabaabcccaaaaaababccb"):
    print(match)

<_sre.SRE_Match object; span=(4, 6), match='ab'>
<_sre.SRE_Match object; span=(6, 9), match='aab'>
<_sre.SRE_Match object; span=(12, 19), match='aaaaaab'>
<_sre.SRE_Match object; span=(19, 21), match='ab'>
<_sre.SRE_Match object; span=(23, 24), match='b'>


In [48]:
phone_nums = """456-111-4567
(919) 444-9721
(123) 456 7890
313.424.5353
1-800-987-2345
+1 (424) 979-3333
555-1212"""

phone_nums = phone_nums.split("\n")
phone_nums

['456-111-4567',
 '(919) 444-9721',
 '(123) 456 7890',
 '313.424.5353',
 '1-800-987-2345',
 '+1 (424) 979-3333',
 '555-1212']

In [49]:
phone_num_regex = r"(?:\(?(\d{3})\)?[\-\.]?\s*)?(\d{3})[\-\.]?\s*(\d{4})"

In [50]:
default_area_code = "919"
for num in phone_nums:
    match = re.search(phone_num_regex, num)
    if match:
        area_code, prefix, suffix = match.groups()
        if area_code is None:
            area_code = default_area_code
        print("{}\t{}-{}-{}".format(num, area_code, prefix, suffix))

456-111-4567	456-111-4567
(919) 444-9721	919-444-9721
(123) 456 7890	123-456-7890
313.424.5353	313-424-5353
1-800-987-2345	800-987-2345
+1 (424) 979-3333	424-979-3333
555-1212	919-555-1212


In [51]:
date_str = """9/4/1976
09/30/77
20111103
Nov 30, 2014
5 Oct 1995
1999-10-04"""

dates = date_str.split("\n")

In [52]:
def extract_date(date_str):
    date_regex = [r"(?P<month>\d{1,2})/(?P<day>\d{1,2})/(?P<year>\d{4}|\d{2})",
                  r"(?P<year>\d{4})-?(?P<month>\d{2})-?(?P<day>\d{2})",
                  r"(?P<day>\d{1,2})\s*(?P<month>[A-Za-z]{3})\s*(?P<year>\d{4})",
                  r"(?P<month>[A-Za-z]{3})\s*(?P<day>\d{1,2})\s*,?\s*(?P<year>\d{4})"]
    
    for regex in date_regex:
        match = re.match(regex, date_str)
        if match:
            return match
        
def clean_date(year, month, day):
    months = {"Jan": 1, "Feb": 2, "Oct": 10, "Nov": 11}

    try:
        month = int(month)
    except ValueError:
        month = months[month]
    day = int(day)
    year = int(year)
    if year < 15:
        year += 2000
    elif year < 100:
        year += 1900
    
    return {"year": year, "month": month, "day": day}
        

for date in dates:
    match = extract_date(date)
    if match:
        ddict = match.groupdict()
        ddict = clean_date(**ddict)
        ddict['orig'] = date
            
        print("{orig}\t{month:02d}/{day:02d}/{year:d}".format(**ddict))

9/4/1976	09/04/1976
09/30/77	09/30/1977
20111103	11/03/2011
Nov 30, 2014	11/30/2014
5 Oct 1995	10/05/1995
1999-10-04	10/04/1999
