# Intro to Regular Expressions

It's a language to describe patterns in text. 

Regular expressions are implemented in every major programming language.

They may differ slightly between languages, but are largely the same. 

In Python we use regular expressions through the "re" package. 

For this demo, we will focus on the "search" function, which provides basic "global" pattern matching on a string, determining whether or not the pattern is matched and returning the location of the match if it exists.

In [2]:
from re import search

In [3]:
# Basic letters and numbers are valid regular expressions

assert search(r'cat', 'a cat went home') != None

In [4]:
assert search(r'cat', 'a dog went home') == None

In [5]:
# You can include "optional" letters by following the letter with 
# a question mark: 

assert search(r'cats?', 'a cat went home') != None
assert search(r'cats?', 'cats went home') != None

In [6]:
# You can include a search for "one or more" with the +
# For example, let's assume we want only to match an 
# exclamation of "cat" that ends with one or more
# exclamation points: 

assert search(r'cats?!+', 'a cat went home') == None
assert search(r'cats?!+', 'a cat!') != None
assert search(r'cats?!+', 'a cat!!!') != None

In [7]:
# Note that our example isn't only matching the
# word "cat":

assert search(r'cat', 'a category') != None

In [8]:
# We can build up a pattern of characters and spaces
# A great character for "space" is given by the \s
# expression. Backslashes in regular expressions
# denote "special characters", such as \s:

assert search(r'\scat\s', 'a category') == None
assert search(r'\scat\s', 'a cat went home') != None

In [9]:
# But now our expression doesn't match the following: 

assert search(r'\scat\s', 'a cat') == None
assert search(r'\scat\s', 'a cat.') == None
assert search(r'\scat\s', 'cat') == None

# which seems problematic

In [10]:
# There is another special character, \b, 
# which stands for "word boundary".
# It is very powerful for this common scenario: 

assert search(r'\bcat\b', 'a cat') != None
assert search(r'\bcat\b', 'a cat.') != None
assert search(r'\bcat\b', 'cat') != None
assert search(r'\bcat\b', 'a cat went home') != None

In [11]:
# Another useful special character is the \w
# character. It matches any "word character" which 
# refers to, basically, letters, numbers and underscores
# This can be used, for example, to find hashtags: 

assert search(r'#\w+', 'a #cat') != None
assert search(r'#\w+', 'a #@home') == None
assert search(r'#\w+', 'a #') == None

In [12]:
# We can also negate things using ^. For example, we might be
# interested in anything that's NOT a space character:
# Note: when negating, you must surround the negated part
# with square brackets []

assert search(r'#[^\s]+', 'a #cat') != None
assert search(r'#[^\s]+', 'a #c@t') != None
assert search(r'#[^\s]+', 'a #@home') != None

In [13]:
# You can use a logical or with "|"

assert search('cat|dog', 'a dog went home') != None
assert search('cat|dog', 'a cat went home') != None

In [14]:
# You can search for digits with \d:

assert search('\d', 'foo1bar') != None
assert search('\d', 'foobar') == None

## The `sub` function

The `search` function is especially useful for showing how regular expressions work, however the `sub` function is one of the most useful for text preprocessing. 

"Sub" stands for **substitution**. This is used to substitute some types of characters for others, or remove some types of characters all together! 

The function is used as such: 

```python
sub(patern, replacement, input_string)
```

Where `pattern` is a regular expression, `replacement` is the substitution, and `input_string` is the string in which you'd like to replace all occurrances of the pattern with the substitution. 

Let's look at some examples.

In [17]:
from re import sub

In [23]:
assert sub('cat', '', 'my cat likes cats') == 'my  likes s'
assert sub('cats*', '', 'my cat likes cats') == 'my  likes '

In [30]:
# We can replace whitespaces with the \s
# pattern:

assert sub('\s+', ' ', 'this   is\n an  annoying\tstring') == 'this is an annoying string'

In [34]:
# We can replace all non-alphanumeric characters 
# in a word: 

assert sub('[^\w]', '', "cat's") == 'cats'

In [40]:
sub('[^\w|\s]', '', "the cat's meow. 100!! Cool!")

'the cats meow 100 Cool'

In [41]:
# We can replace all non-alphanumeric or space characters 
# in a sentence to get rid of punctuation:

assert sub('[^\w|\s]', '', "the cat's meow. 100!! Cool!") == 'the cats meow 100 Cool'