# Regular Expressions

## Regular Expressions and Grep

In theoretical computer science and formal language theory, a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) (abbreviated regex or regexp and sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and [grep](https://en.wikipedia.org/wiki/Grep) (global regular expression print), a filter.

grep is a command-line utility for searching plain-text data sets for lines matching a regular expression. Grep was originally developed for the Unix operating system, but is available today for all Unix-like systems and is built in to languages like python and Perl.

# Regular Expressions Examples

Basic regex syntax

```
.	Normally matches any character except a newline.  

When you match a pattern within parentheses, you can use any of $1, $2, ... later to refer to the previously matched pattern.	

+	Matches the preceding pattern element one or more times.  	
?	Matches the preceding pattern element zero or one times.  	
*	Matches the preceding pattern element zero or more times.  
|	Separates alternate possibilities.	 

\w	Matches an alphanumeric character, including "_";  same as [A-Za-z0-9_] in ASCII, and
[\p{Alphabetic}\p{GC=Mark}\p{GC=Decimal_Number}\p{GC=Connector_Punctuation}]  

\W	Matches a non-alphanumeric character, excluding "_";
same as [^A-Za-z0-9_] in ASCII, and
[^\p{Alphabetic}\p{GC=Mark}\p{GC=Decimal_Number}\p{GC=Connector_Punctuation}]  

\s	Matches a whitespace character,
which in ASCII are tab, line feed, form feed, carriage return, and space;  

\S	Matches anything BUT a whitespace.	 

\d	Matches a digit;
same as [0-9] in ASCII;  

\D	Matches a non-digit;

^	Matches the beginning of a line or string.	  

$	Matches the end of a line or string.	  
```

Some simple regex examples 

```
 {^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$}  # Floating Point Number

{^[A-Za-z]+$}   # Only letters.

 {^[[:alpha?:]]+$} # Only letters, the Unicode way.

 {(.)\1{3}} $string {\1} result # Back References 

(\[0-9]{1,3})\.(\[0-9]{1,3})\.(\[0-9]{1,3})\.(\[0-9]{1,3}) # IP Numbers 
```


Some useful RegEx:

| Character | Description                 || Character | Description                     |
|-----------|-----------------------------||-----------|---------------------------------|
| ``"\d"``  | Match any digit             || ``"\D"``  | Match any non-digit             |
| ``"\s"``  | Match any whitespace        || ``"\S"``  | Match any non-whitespace        |
| ``"\w"``  | Match any alphanumeric char || ``"\W"``  | Match any non-alphanumeric char |

See Python's [regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import warnings
import random
from datetime import datetime
random.seed(datetime.now())
warnings.filterwarnings('ignore')

# Make plots larger
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
import re
s='the quick brown fox jumped over the lazy dog'
regex = re.compile('\s+')
w=regex.split(s)
w

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

In [None]:
pattern='o'
for t in w:
    print ("Testing ", t)
    if re.search(pattern,t):
        print(repr(t), "matches")
    else:
        print(repr(t), "does not match")

Testing  the
'the' does not match
Testing  quick
'quick' does not match
Testing  brown
'brown' matches
Testing  fox
'fox' matches
Testing  jumped
'jumped' does not match
Testing  over
'over' matches
Testing  the
'the' does not match
Testing  lazy
'lazy' does not match
Testing  dog
'dog' matches


### Matching Versus Searching

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default)


| Method/Attribute 	| Purpose                                                                      |
|-------------------|------------------------------------------------------------------------------|
| match()           | Determine if the RE matches at the beginning of the string.                  |
| search() 	        | Scan through a string, looking for any location where this RE matches.       |
| findall() 	    | Find all substrings where the RE matches, and returns them as a list.        |
| finditer() 	    | Find all substrings where the RE matches, and returns them as an iterator(*).|

(*) an iterator works very much like a list in that for instance you can loop over it, but items are computed on the fly as they are needed, so it is more memory-efficient.


In [None]:
pattern='o'
for t in w:
    print ("Testing ", t)
    if re.match(pattern,t):
        print(repr(t), "matches")
    else:
        print(repr(t), "does not match")

Testing  the
'the' does not match
Testing  quick
'quick' does not match
Testing  brown
'brown' does not match
Testing  fox
'fox' does not match
Testing  jumped
'jumped' does not match
Testing  over
'over' matches
Testing  the
'the' does not match
Testing  lazy
'lazy' does not match
Testing  dog
'dog' does not match


In [None]:
f = re.compile('fox')
s2=f.sub('BEAR', s)
s2

'the quick brown BEAR jumped over the lazy dog'

## e-mail - \w+@\w+\.[a-z]{3}

\w+@\w+\.[a-z]{3}


\w+ matches any word character (equal to [a-zA-Z0-9_])   

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

@ matches the chr @ literally (case sensitive)aracte

\w+ matches any word character (equal to [a-zA-Z0-9_])

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

\. matches the character . literally (case sensitive)

Match a single character present in the list below [a-z]{3}
{3} Quantifier — Matches exactly 3 times
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)


In [None]:
email = re.compile('\w+@\w+\.[a-z]{3}')
e='Professor Bear is awesome! E-mail him nikbearbrown@gmail.com or nik@ucla.edu if you have questions'
email.findall(e)

['nikbearbrown@gmail.com', 'nik@ucla.edu']

In [None]:
re.findall("[Hh][ea]llo", "Hallo Bear, hello Nik!")

['Hallo', 'hello']

#### Parentheses indicate *groups* to extract

If one wants to extract components rather than the full match, then one uses parentheses to *group* the results.

In [None]:
email2 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')
email2.findall(e)

[('nikbearbrown', 'gmail', 'com'), ('nik', 'ucla', 'edu')]

In [None]:
seuss = ["You have brains in your head.",
           "You have feet in your shoes.", 
           "You can steer yourself any direction you choose.",            
           "You're on your own.", 
           "And you know what you know.", 
           "And YOU are the one who'll decide where to go...",            
           "- Dr. Seuss"]
seuss

['You have brains in your head.',
 'You have feet in your shoes.',
 'You can steer yourself any direction you choose.',
 "You're on your own.",
 'And you know what you know.',
 "And YOU are the one who'll decide where to go...",
 '- Dr. Seuss']

In [None]:
re.findall("you",seuss[0])

['you']

In [None]:
re.findall("you",seuss[0],re.IGNORECASE)

['You', 'you']

In [None]:
print(seuss[5])
vowel_pattern = re.compile(r"a|e|o|u|i")
no_vowels = vowel_pattern.sub('', seuss[5])
print(no_vowels)

And YOU are the one who'll decide where to go...
And YOU r th n wh'll dcd whr t g...


In [None]:
vowel_pattern_cap = re.compile(r"a|A|e|E|o|O|u|U|i|I")
no_vowels = vowel_pattern_cap.sub('', seuss[5])
print(no_vowels)

nd Y r th n wh'll dcd whr t g...


### Search and Replace
One of the most important re methods that use regular expressions is sub.

Syntax
re.sub(pattern, repl, string, max=0)
This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max is provided. This method returns modified string.

In [None]:
no_wspace='You have brains in your head.'
wspace='     You have brains in your head.      '

In [None]:
print (len(wspace))
print (wspace)
st=re.sub('[ ]+$','',wspace)
print (len(st))
print (st)
st=re.sub("^[ ]+",'',st)
print (len(st))
print (st)

40
     You have brains in your head.      
34
     You have brains in your head.
29
You have brains in your head.


In [None]:
pattern = re.compile(r"x{3,5}")
print(pattern.match(""))
print(pattern.match("x"))
print(pattern.match("xx"))
print(pattern.match("xxx"))
print(pattern.match("xxxx"))
print(pattern.match("xxxxx"))
print(pattern.match("xxxxxx"))
print(pattern.match("xxxxxxxx"))

None
None
None
<_sre.SRE_Match object; span=(0, 3), match='xxx'>
<_sre.SRE_Match object; span=(0, 4), match='xxxx'>
<_sre.SRE_Match object; span=(0, 5), match='xxxxx'>
<_sre.SRE_Match object; span=(0, 5), match='xxxxx'>
<_sre.SRE_Match object; span=(0, 5), match='xxxxx'>


The letter 'x' if it occurs 3,4 or 5 times in a row

Last update October 3, 2017

The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT).