In [2]:
import re

# Regex

A regular expression is a special sequence of characters that describe a pattern of text that should be found, or matched, in a string or document. By matching text, we can identify how often and where certain pieces of text occur, as well as have the opportunity to replace or update these pieces of text if needed.

## Literals

This is where our regular expression contains the exact text that we want to match. The regex `a`, for example, will match the text a, and the regex bananas will match the text bananas.

In [3]:
text = 'Monkeys love bananas'
regex = re.findall('bananas',text)
display(regex)

['bananas']

## Alternation

Alternation, performed in regular expressions with the pipe symbol, `|` , allows us to match either the characters preceding the `|` OR the characters after the `|`.

In [68]:
words = "cat gorilla bananas monkey"
regex = re.findall('bananas|cat',words)
display(regex)

['cat', 'bananas']

## Character Sets

Character sets, denoted by a pair of brackets `[]`, let us match one character from a series of characters, allowing for matches with incorrect or different spellings.

The regex con`[sc]`en`[sc]`us will match consensus, the correct spelling of the word, but also match the following three incorrect spellings: concensus, consencus, and concencus. The letters inside the first brackets, s and c, are the different possibilities for the character that comes after con and before en. Similarly for the second brackets, s and c are the different character possibilities to come after en and before us.

We can make our character sets even more powerful with the help of the caret `^` symbol. Placed at the front of a character set, the `^` negates the set, matching any character that is not stated. These are called negated character sets. Thus the regex `[^cat]` will match any character that is not c, a, or t, and would completely match each character `d`, `o` or `g`.

In [6]:
#Well spelled word CONSENSUS
word = "concensus"
regex = re.search('con[sc]en[sc]us',word)
display(regex)

<re.Match object; span=(0, 9), match='concensus'>

## Wild for Wildcards

Sometimes we don’t care exactly WHAT characters are in a text, just that there are SOME characters. Enter the wildcard `.`! Wildcards will match any single character (letter, number, symbol or whitespace) in a piece of text. **They are useful when we do not care about the specific value of a character, but only that a character exists!**

Let’s say we want to match any 9-character piece of text. The regex `.........` will completely match `orangutan` and `marsupial`! Similarly, the regex `I ate . bananas` will completely match both I` ate 3 bananas` and `I ate 8 bananas!`

In [12]:
words = "monkeys are amazing"
regex = re.search('.......', words)
display(regex)

<re.Match object; span=(0, 7), match='monkeys'>

## Ranges

Ranges allow us to specify a range of characters in which we can make a match without having to type out each individual character. The regex `[abc]`, which would match any character `a`, `b`, or `c`, is equivalent to regex range `[a-c]`. The `-` character allows us to specify that we are interested in matching a range of characters

In [72]:
#match only the words cub,dog and elk
match_words = 'cub dog elk ape cow ewe'
regex = re.findall('[c-e][l-u][b-k]',match_words)
display(regex)

['cub', 'dog', 'elk']

## Shorthand Character Classes

* **\w**: the “word character” class represents the regex range `[A-Za-z0-9_]`, and it matches a single uppercase character, lowercase character, digit or underscore


* **\d**: the “digit character” class represents the regex range `[0-9]`, and it matches a single digit character


* **\s**: the “whitespace character” class represents the regex range `[ \t\r\n\f\v]`, matching a single space, tab, carriage return, line break, form feed, or vertical tab


For example, the regex `\d\s\w\w\w\w\w\w\w` matches a **digit character**, followed by a **whitespace** character, followed by **7 word characters**. Thus the regex completely matches `the text 3 monkeys`.

In [74]:
#Match these: 5 sloths ,8 llamas,7 hyenas 
#Don't match these: one bird , two owls

text = '5 sloths one bird 8 llamas two owls 7 hyenas'
regex = re.findall('\d\s\w\w\w\w\w\w',text)
display(regex)

['5 sloths', '8 llamas', '7 hyenas']

## Grouping

Grouping, denoted with the open parenthesis ( and the closing parenthesis ), lets us group parts of a regular expression together, and allows us to limit alternation to part of the regex.

The regex `I love (baboons|gorillas)` will match the text `I love` and then match either `baboons` or `gorillas,` as the grouping limits the reach of the `|` to the text within the parentheses.

In [13]:
#Match these: puppies are my favorite!, kitty cats are my favorite!
#Don't match these: deer are my favorite!,otters are my favorite!, hedgehogs are my favorite!

text = 'puppies are my favorites! deer are my favorite! kitty cats are my favorite! otters are my favorite! hedgehogs are my favorite!'
regex = re.findall('(puppies|kitty cats) are my favorite',text)
display(regex)

['puppies', 'kitty cats']

## Quantifiers - Fixed

Fixed quantifiers, denoted with curly braces `{}`, let us indicate the exact quantity of a character we wish to match, or allow us to provide a quantity range to match on.

* `\w{3}` will match exactly **3 word characters**


* `\w{4,7}` will match at **minimum 4 word characters** and at **maximum 7 word characters**

The regex `roa{3}r` will match the characters `ro` followed by 3 `as`, and then the character `r`, such as in the text `roaaar`. The regex `roa{3,7}r` will match the characters `ro` followed by at least 3 `a` and at most 7 `a`s, followed by an `r`, matching the strings `roaaar`, `roaaaaar` and `roaaaaaaar`.

An important note is that quantifiers are considered to be greedy. This means that **they will match the greatest quantity of characters they possibly can**. For example, the regex `mo{2,4}` will match the text `moooo` in the string `moooo`, and not return a match of `moo`, or `mooo`. **This is because the fixed quantifier wants to match the largest number of os as possible, which is 4 in the string moooo.**

In [87]:
#Match these: squeaaak,squeaaaak,squeaaaaak
#Don't match these: squeak,squeaak,squeaaaaaak

text = 'squeaaak squeak squeaaaak squeaak squeaaaaak squeaaaaaak'
regex = re.findall('squea{3,5}k',text)
display(regex)

['squeaaak', 'squeaaaak', 'squeaaaaak']

## Quantifiers - Optional

Optional quantifiers, indicated by the question mark `?`, allow us to indicate a character in a regex is optional, or can appear either 0 times or 1 time. For example, the regex `humou?r` matches the characters `humo`, then either 0 occurrences or 1 occurrence of the letter `u`, and finally the letter `r`. Note the `?` only applies to the character directly **before** it.

With all quantifiers, we can take advantage of grouping to make even more advanced regexes. The regex `The monkey ate a (rotten )?banana` will completely match both `The monkey ate a rotten banana` and `The monkey ate a banana`.

In [21]:
#Match these: 1 duck for adoption? ,5 ducks for adoption? ,7 ducks for adoption?

text = ' 1 duck for adoption? 5 ducks for adoption? 7 ducks for adoption?'
regex = re.findall('\d ducks? for adoption\?',text)
display(regex)

['1 duck for adoption?', '5 ducks for adoption?', '7 ducks for adoption?']

## Quantifiers - 0 or More, 1 or More

The Kleene star, denoted with the asterisk `*`, is also a quantifier, and matches the preceding character 0 or more times. This means that **the character doesn’t need to appear, can appear once, or can appear many many times**.

The regex `meo*w` will match the characters `me`, followed by 0 or more `os`, followed by a `w`. Thus the regex will match `mew`, `meow`, `meooow`, and `meoooooooooooow`.

Another useful quantifier is the Kleene plus, denoted by the plus `+`, which **matches the preceding character 1 or more times.**

The regex `meo+w` will match the characters `me`, followed by 1 or more `os`, followed by a `w`. Thus the regex will match `meow`, `meooow`, and `meoooooooooooow`, but not match `mew`.

In [23]:
#Match these: hoot hoooooot hooooooooooot
#Don't match these: hot hoat hoo
text = ' hoot hoooooot hooooooooooot hot hoat hoo'
regex = re.findall('hoo+t',text)
display(regex)

['hoot', 'hoooooot', 'hooooooooooot']

## Anchors

The anchors hat `^` and dollar sign '$' are used to match text at the start and the end of a string, respectively.

The regex `^Monkeys: my mortal enemy$` will completely match the text `Monkeys: my mortal enemy` but not match `Spider Monkeys: my mortal enemy in the wild` or `Squirrel Monkeys: my mortal enemy in the wild`. **The ^ ensures that the matched text begins with Monkeys, and the $ ensures the matched text ends with enemy.**

Without the anchor tags, the regex `Monkeys: my mortal enemy` will match the text `Monkeys: my mortal enemy` in both `Spider Monkeys: my mortal enemy in the wild` and `Squirrel Monkeys: my mortal enemy in the wild`.



In [24]:
#Match this: penguins are cooler than regular expressions
#Don't match these: king penguins are cooler than regular expressions, penguins are cooler than regular expressions!

text = "king penguins are cooler than regular expressions penguins are cooler than regular expressions penguins are cooler than regular expressions!"

regex = re.findall('^penguins are cooler than regular expressions$',text)
display(regex)


[]