<img src="https://kasunkodagoda.gallerycdn.vsassets.io/extensions/kasunkodagoda/regex-match-replace/2.1.5/1567104415777/Microsoft.VisualStudio.Services.Icons.Default" style="float: left; margin: 20px; height: 55px">

# Regular Expressions

_Author: Alfred Zou_

---

* Regular expressions are a powerful way to search within strings
* It is has wide applications in bash's grep, SQL, python and text editors
* The downside of regex is that it can be overly complicated, and down right intimidating for those who don't understand it
* Further information for grep searching can be found [here](http://opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics/)
* A good way to learn Regex is through [RegExr](http://regexr.com/). This website lets you visualise matches

### Regex Syntax

* The standard regex syntax is:
* Regex searches the string left to right using the search_pattern
* There are multiple useful regex methods such as `findall()`, `search()`, `replace()`, etc
* Optional flags can be utilised for extra search customisation

``` python
import re
re.method(search_pattern, string, flags)
```

### Search Pattern Syntax

#### Characters
* Characters are the fundamental building blocks of search strings. They are single characters that can be modified by quantifiers, logics, groups and lookarounds.
* `a` - Literals. Exact character match.
* `[tT]` or `[a-z]` - Character classes. One character within the bracket. 
* `[^a]` - Character class negation. Negates the whole character class
* `\$` - Special classes need to be escaped using `\`
* `.` - any charcter except a line break (\n)
* `\w` - word characters, including digits and underscores
* `\W` - opposite of `\w`
* `\d` - all digit characters
* `\D` - opposite of `\d`
* `\s` - all white space
* `\S` - opposite of `\s`

#### Quantifiers
* Quantifiers can be placed after characters or groups to modify them
* `?` - 0 or 1
* `*` - 0 or more
* `+` - 1 or more
* `{m}` - m number of times
* `{m,}` - m or more
* `{m,n}` - m to n
* Quantifiers are inherently greedy, which means they will match the maximum characters they can
* `?` - added to the end of a quantifier makes it lazy, which means it will match the minimum characters it can

#### Anchors and Boundaries
* Anchors and boundaries are included at the beginning or end of a search string to denote positional matches
* `^` - beginning of string
* `$` - end of string
* `\b` - Word boundary. Beginning or end of word
* `\B` - opposite of `\b`

#### Lookarounds
* For a search pattern checks the lookaround ahead (lookahead) or behind (lookbehind). 
    * If true, returns a match for a postive lookaround.
    * If false, returns a match for a negative lookaround.
* The lookaround is never returned with the match
* `(?=string)` - Positive lookahead. 
* `(?<=string)` - Positive lookbehind.
* `(?!string)` - Negative lookahead.
* `(?<!string)` - Negative lookbehind.

#### Logic, Grouping and Capture
* `[a|b]` or `(cat|dog)` - Alteration. Either collection to the left or right of the alteration
* `((abc)def)` - Grouping. This captures the match and is returned by regex methods
* Captured contents can be recalled and are ordered from outside to inside, then left to right 
* Using raw strings is recommended otherwise recalling captured contents won't work, `r'search_pattern'`
    * `\1` - Group 1
    * `\2` - Group 2

### Flags
* Flags add additional search customisation
* `re.I` - case insensitive
* `re.M` - Multiline. Allows `^` & `$` to function at the beginning and end of line breaks, `\n`, opposed to at the beginning and end of the string

### Regex Methods

##### Findall
* Returns a list of all matched strings, or
* A tuple of all the groups, following the captured contents order. In this case the matched string is not included in the final output

In [1]:
# The output in this case is a list of tuples for every match. In every tuple, its the groups included.
import re
my_string = 'bob bob_ ralph_ bobbobbobbybobbob bab_ ralph_ bob'
re.findall('((b.b_) (ralph))_ bob',my_string)

[('bob_ ralph', 'bob_', 'ralph'), ('bab_ ralph', 'bab_', 'ralph')]

##### Search
* Finds the first match. Use `findall()` for every match
* Returns a match object that needs to be opened with `group()`
    * `group()` returns match
    * `group(1)` returns group 1
    * `group(2)` returns group 2

In [2]:
# First match only
import re
my_string = 'star cat hat mat'
results = re.search('.at',my_string)
print(results.group(),end = '\n')

cat


In [3]:
# Demonstration of the group outputs
import re
my_string = 'bob bob_ ralph_ bobbobbobbybobbob'
results = re.search('((bob) (bob_)) ralph_',my_string)
print(results.group(),end = '\n')
print(results.group(1),end = '\n')
print(results.group(2),end = '\n')
print(results.group(3),end = '\n')

bob bob_ ralph_
bob bob_
bob
bob_


##### Match
* Match is similar to search, but is for the first word only

In [4]:
# Searches the first word only
import re
my_string = 'star cat hat mat'
results = re.match('.{3}',my_string)
print(results.group(),end = '\n')

sta


##### Sub
* Regex searches the string left to right using the search_pattern and replaces with the replacement
* Use raw strings to call group contents with \1 & \2
* The format is: 

``` python
import re
re.sub(search_pattern,replacement, string, flags)
```

In [5]:
my_string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
re.sub(r'(\w+)@[\w.]+', r'\1@hotmail.com', my_string)

'purple alice@hotmail.com, blah monkey bob@hotmail.com blah dishwasher'

### Pandas Integration
* Regex capabilities can be utilised on Panda Series containining a full collection of strings
* The general format is:

``` python
import pandas as pd
my_series = pd.Series([string1,string2,string3])
my_series = pd.str.method()
```

##### str.split()
* Splits strings similar to the `split()` method for strings but regex can be applied

In [6]:
import pandas as pd
my_series = pd.Series(['tim@hotmail.com','trevor@outlook.com','ashely@google.com','tiffany@msn.com'])
my_series.str.split('[@.]')

0       [tim, hotmail, com]
1    [trevor, outlook, com]
2     [ashely, google, com]
3       [tiffany, msn, com]
dtype: object

##### str.replace()

In [7]:
import pandas as pd
my_series = pd.Series(['tim@hotmail.com','trevor@outlook.com','ashely@google.com','tiffany@msn.com'])
my_series.str.replace('(?<=@).*','yahoo.com')

0        tim@yahoo.com
1     trevor@yahoo.com
2     ashely@yahoo.com
3    tiffany@yahoo.com
dtype: object

##### str.contains()

In [8]:
import pandas as pd
my_series = pd.Series(['tim@hotmail.com','trevor@outlook.com','ashely@google.com','tiffany@msn.com','algoexpert.io','generalassemb.ly'])
my_series[my_series.str.contains('@')]

0       tim@hotmail.com
1    trevor@outlook.com
2     ashely@google.com
3       tiffany@msn.com
dtype: object

##### str.extract()
* Extracts capture groups

In [9]:
import pandas as pd
my_series = pd.Series(['tim@hotmail.com','trevor@outlook.com','ashely@google.com','tiffany@msn.com','algoexpert.io','generalassemb.ly'])
my_series.str.extract('(.*)@(.*).com')

Unnamed: 0,0,1
0,tim,hotmail
1,trevor,outlook
2,ashely,google
3,tiffany,msn
4,,
5,,


##### pd.filter()
* Filter through column names in a df

In [10]:
import pandas as pd
import numpy as np

# Fix the seed, so we can always generate the same np array
# Generate np array with shape (3,4)
np.random.seed(44)
matrix = np.random.rand(3,32)

# Column names we're filtering through
column_names = ['id','malignant',
                'nucleus_mean','nucleus_se','nucleus_worst',
                'texture_mean','texture_se','texture_worst',
                'perimeter_mean','perimeter_se','perimeter_worst',
                'area_mean','area_se','area_worst',
                'smoothness_mean','smoothness_se','smoothness_worst',
                'compactness_mean','compactness_se','compactness_worst',
                'concavity_mean','concavity_se','concavity_worst',
                'concave_pts_mean','concave_pts_se','concave_pts_worst',
                'symmetry_mean','symmetry_se','symmetry_worst',
                'fractal_dim_mean','fractal_dim_se','fractal_dim_worst']

# Create and print df
df = pd.DataFrame(matrix,columns=column_names)
df.head(3)

Unnamed: 0,id,malignant,nucleus_mean,nucleus_se,nucleus_worst,texture_mean,texture_se,texture_worst,perimeter_mean,perimeter_se,...,concavity_worst,concave_pts_mean,concave_pts_se,concave_pts_worst,symmetry_mean,symmetry_se,symmetry_worst,fractal_dim_mean,fractal_dim_se,fractal_dim_worst
0,0.834842,0.104796,0.74464,0.360501,0.359311,0.609238,0.39378,0.409073,0.509902,0.710148,...,0.458704,0.873863,0.25845,0.664851,0.862674,0.148848,0.56295,0.159155,0.172895,0.104023
1,0.202938,0.455189,0.794575,0.990823,0.805017,0.377415,0.515737,0.058899,0.711096,0.072508,...,0.984284,0.27156,0.897152,0.164235,0.132358,0.317353,0.307423,0.422055,0.330996,0.568075
2,0.095285,0.797977,0.268656,0.911415,0.901744,0.015858,0.853682,0.675606,0.03768,0.308948,...,0.11268,0.853298,0.048454,0.499429,0.50309,0.764727,0.347479,0.931419,0.299206,0.551129


In [11]:
# Filter all the columns that have mean
df.filter(regex='[^(se)^(worst)]$')

Unnamed: 0,id,nucleus_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_pts_mean,symmetry_mean,fractal_dim_mean
0,0.834842,0.74464,0.609238,0.509902,0.456621,0.217899,0.881824,0.636832,0.873863,0.862674,0.159155
1,0.202938,0.794575,0.377415,0.711096,0.726058,0.69743,0.094972,0.108974,0.27156,0.132358,0.422055
2,0.095285,0.268656,0.015858,0.03768,0.594078,0.938702,0.108886,0.018255,0.853298,0.50309,0.931419


##### Column filtering

In [14]:
# Resetting the df
import seaborn as sns
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [15]:
# What if we wanted species containing a string
df[df['species'].str.contains(r'set')].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
