# **Regular Expressions Demystified**

This notebook aims to demystify some of the patterns that are utilized in our interaction with Regular Expressions related syntax and methods from different libraries...
<br><br>
We begin by importing some modules...

In [1]:
import re
import certifi
import json
import pandas as pd

import urllib3
from urllib3 import request

# Handle Certification Validation
http = urllib3.PoolManager(
    cert_reqs = 'CERT_REQUIRED',
    ca_certs = certifi.where())

# Get data from API
url = 'https://data.nasa.gov/resource/y77d-th95.json'
r = http.request('GET', url)
assert r.status == 200

data = json.loads(r.data.decode('utf-8'))

# **Character Classes**
These patterns seeks to match characters based on the character type. (i.e. Alphanumeric, digits, whitespace, etc.)

Some common patterns used are:


*   **\w** - Matches alphanumeric characters
*   **\d** - Matches digits, 0 to 9
*   **\s** - Matches whitespace characters, including \t, \n, \r and space characters

and their inversion with UPPERCASE:
*   **\W** - Matches non-alphanumeric characters
*   **\D** - Matches any non-digits
*   **\S** - Matches non-whitespace characters

and finally, to find special characters like hypens, soft-brackets or curly bracers, we used the **\** backspace character to escape it (i.e. '\-', '\{', '\)', etc.)
<br><br>
Here are some examples:

In [2]:
# Using a tabular dataset by NASA on Earth Meteorite Landings
meteorites = pd.json_normalize(data)

# We can write a simple function to 
def find_meteorites(df, column, pattern):
  return df[df[column].str.contains(pattern)]

In [3]:
# Finding Meteorites labeled with 'ac' in between...
find_meteorites(meteorites, 'name', r'\wac\w').head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,geolocation.coordinates,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
0,Aachen,1,Valid,L5,21,Fell,1880-01-01T00:00:00.000,50.775,6.08333,Point,"[6.08333, 50.775]",,
65,Bachmut,4917,Valid,L6,18000,Fell,1814-01-01T00:00:00.000,48.6,38.0,Point,"[38, 48.6]",,
85,Bath Furnace,4975,Valid,L6,86000,Fell,1902-01-01T00:00:00.000,38.25,-83.75,Point,"[-83.75, 38.25]",36.0,1921.0
122,Black Moshannan Park,5065,Valid,L5,705,Fell,1941-01-01T00:00:00.000,40.91667,-78.08333,Point,"[-78.08333, 40.91667]",48.0,2495.0
123,Blackwell,5068,Valid,L5,2381,Fell,1906-01-01T00:00:00.000,36.83333,-97.33333,Point,"[-97.33333, 36.83333]",20.0,2164.0


In [4]:
# Finding Meteorites labeled with digits in its name...
find_meteorites(meteorites, 'name', r'\d').head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,geolocation.coordinates,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
37,Northwest Africa 5815,50693,Valid,L5,256.8,Found,,0.0,0.0,Point,"[0, 0]",,
137,Boumdeid (2003),57168,Valid,L6,190.0,Fell,2003-01-01T00:00:00.000,17.71067,-11.3715,Point,"[-11.3715, 17.71067]",,
138,Boumdeid (2011),57167,Valid,L6,3599.0,Fell,2011-01-01T00:00:00.000,17.17493,-11.34133,Point,"[-11.34133, 17.17493]",,
480,Kijima (1906),12305,Valid,Stone-uncl,331.0,Fell,1906-01-01T00:00:00.000,36.85,138.38333,Point,"[138.38333, 36.85]",,
520,Cumulus Hills 04075,32531,Valid,Pallasite,9.6,Found,2003-01-01T00:00:00.000,,,,,,


In [5]:
# Finding Meteorites with name that consist of 2 words or more...
find_meteorites(meteorites, 'name', r'\s').head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,geolocation.coordinates,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
5,Adhi Kot,379,Valid,EH4,4239,Fell,1919-01-01T00:00:00.000,32.1,71.8,Point,"[71.8, 32.1]",,
6,Adzhi-Bogdo (stone),390,Valid,LL3-6,910,Fell,1949-01-01T00:00:00.000,44.83333,95.16667,Point,"[95.16667, 44.83333]",,
9,Aguila Blanca,417,Valid,L,1440,Fell,1920-01-01T00:00:00.000,-30.86667,-64.55,Point,"[-64.55, -30.86667]",,
10,Aioun el Atrouss,423,Valid,Diogenite-pm,1000,Fell,1974-01-01T00:00:00.000,16.39806,-9.57028,Point,"[-9.57028, 16.39806]",,
17,Al Rais,446,Valid,CR2-an,160,Fell,1957-01-01T00:00:00.000,24.41667,39.51667,Point,"[39.51667, 24.41667]",,


In [6]:
# Finding Meteorites with a single letter enclosed in soft brackets in its name...
find_meteorites(meteorites, 'name', r'\(\w\)').head()

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,geolocation.type,geolocation.coordinates,:@computed_region_cbhk_fwbd,:@computed_region_nnqa_25f4
94,Benares (a),5011,Valid,LL4,3700.0,Fell,1798-01-01T00:00:00.000,25.36667,82.91667,Point,"[82.91667, 25.36667]",,
316,Galim (a),10848,Valid,LL6,36.1,Fell,1952-01-01T00:00:00.000,7.05,12.43333,Point,"[12.43333, 7.05]",,
317,Galim (b),10849,Valid,EH3/4-an,28.0,Fell,1952-01-01T00:00:00.000,7.05,12.43333,Point,"[12.43333, 7.05]",,


# **Special Characters**
This category of patterns utilize special characters  to match expressions positionally or 'nth' times (aka Greediness).

These patterns includes:


*   **^** - Matches the start of a string
*   **$** - Matches end of a string
*   **+** - Matches 1 or more times
*   **\*** - Matches 0 or more times
*   **?** - Matches 0 or 1 times

and the following patterns dictates the number of times an expression is to be matched:
*   **\{m\}** - Matches m times, and not less
*   **\{m, n}** - Matches m to n times, and not less
*   **\{m, n}?** - Matches m times and ignores n

Lastly, we can try to apply two patterns to be applied for matching in an either/or scenario:
* **A\|B** - Matches expression A or B. If A is matched first, B is bypassed. 
<br><br>

Walking through some examples:

In [9]:
import os
import re
import certifi
import json

import urllib3
from urllib3 import request

# Handle Certification Validation
http = urllib3.PoolManager(
    cert_reqs = 'CERT_REQUIRED',
    ca_certs = certifi.where())

# Get data from API
url = 'https://raw.githubusercontent.com/dylanfan-wj/colab-notebooks/main/iamthewalrus.json'
r = http.request('GET', url)
assert r.status == 200

song = json.loads(r.data.decode('utf-8'))

In [10]:
# Here we utilize the 're.sub' method to substitute all newline characters '\n' with ' '...
lyrics = re.sub('\n', ' ', song['Lyrics'])
lyrics

"I am he as you are he as you are me And we are all together See how they run like pigs from a gun See how they fly I'm crying Sitting on a corn flake Waiting for the van to come Corporation T-shirt, stupid bloody Tuesday Man you've been a naughty boy You let your face grow long I am the egg man They are the egg men I am the walrus Goo goo g'joob Mister City policeman sitting Pretty little policemen in a row See how they fly like Lucy in the sky, see how they run I'm crying, I'm crying I'm crying, I'm crying Yellow matter custard Dripping from a dead dog's eye Crabalocker fishwife, p*********** priestess Boy, you've been a naughty girl, you let your knickers down I am the egg man They are the egg men I am the walrus Goo goo g'joob Sitting in an English garden Waiting for the sun If the sun don't come you get a tan From standing in the English rain I am the egg man (now good sir) They are the egg men (a poor man, made tame to fortune's blows) I am the walrus Goo goo g'joob, goo goo goo 

In [11]:
# To recap what we have undergone in 'Character Classes'...
# Decrpyt: 'H\w+\s\w+' - Words (depict by \w+) that are seperated by a single 
#                        whitespace character, '\s' that begins with the letter
#                        'H'.
re.findall(r'H\w+\s\w+', lyrics)

['Hare Krishna']

In [13]:
# Finding all words that have 2 or more consecuetive 'o'... and multiple characters therefter
oo_words = re.findall(r'\w+o{2,}\w*', lyrics)
print("Total words with 'oo': " + str(len(oo_words)))
print("Unique list of 'oo' words:")
set(oo_words)

Total words with 'oo': 44
Unique list of 'oo' words:


{'Goo', 'Joob', 'Jooba', 'bloody', 'goo', 'good', 'joob', 'jooba', 'poor'}

In [12]:
# Finding all words that have 2 or more consecuetive 'o'... and a character thereafter 0 or 1 times
oo_words = re.findall(r'\w+o{2,}\w?', lyrics)
print("Total words with 'oo': " + str(len(oo_words)))
print("Unique list of 'oo' words:")
set(oo_words)

Total words with 'oo': 44
Unique list of 'oo' words:


{'Goo', 'Joob', 'blood', 'goo', 'good', 'joob', 'poor'}

---
But what if we want to match same pattern of word that appears consecutively? What about words that does not have consecutive 'oo' in it? We can use the concepts established in 'Sets' and 'Groups'


# **Sets** 

To match a single character, we use the '['and']' to contain the set of characters that we wish to match.

Some frequently used patterns are:


*   **[a-zA-Z]** - Matches any single character from a to z and A to Z
*   **[a-z0-9]** - Matches any single character from a to z and 0 to 9
*   **[a-zA-Z0-9]** - Essentially the same as **\w**
*   **[^ab5]** - With **^** in **[** **]**, it excludes any characters placed within from matching (Negation)
<br>
<br>

# **Groups**

The most difficult to grasp amongst the categories of patterns. Characters placed inside the **(** and **)** groups them for matches. The parenthesis have different behaviors based on how the pattern is written.

Some commonly used patterns are:

*    **(ab)** - Matches only 'ab'. Similar to **[ab]+**
*    **(?aiLmsux)** - The characters a, i, L, m, s, u, x are character flags:
    * a - Matches ASCII only
    * i - Ignore case
    * L - Locale dependent
    * m - Multi-line
    * s - Matches all
    * u - Matches unicode
    * x - Verbose
*    **(?:A)** - Matches expression represented by A
*    **A(?=B)** - Positive Lookahead. Matches A only if followed by B.
*    **A(?!B)** - Negative lookahead. Matches A only if not followed by B
*    **(?<=B)A** - Positive lookbehind. Matches A only if B is immediately to its left.
*    **(?<!B)A** - Negative lookbehind. Matches A only if B is not immediately to its left.
*    **(?P=name)** - Matches expression matched by earlier group named "name"
*    **(...)\1** - Number 1 corresponds to the first group to be matched. We can use from 1 to 99 os such groups and their corresponding numbers to match more instances of the same expression,instead of re-writing the whole expression again.



In [14]:
# Findall lines that have words with 2 or more 'o's appearing consecutively...
consecutive_oo_words = re.findall(r'(?:\s\w+o{2,}){2,3}', lyrics)
print("No. of Occurences: " + str(len(consecutive_oo_words)))
consecutive_oo_words

# Decrypting: (?:\s\w+o{2,}){2,3}
# Where: (?:\s\w+o{2,}) - Capturing group to match words with a whitespace in front,
#                     and a series of characters that ends with 'o' appearing
#                     2 or more times.
#        {2,3} - Match and return results where the capturing group appears
#                at least 2 times but not more than 3.                     

No. of Occurences: 9


[' Goo goo',
 ' Goo goo',
 ' Goo goo',
 ' goo goo goo',
 ' Goo goo',
 ' goo goo goo',
 ' Goo goo',
 ' goo goo goo',
 ' goo Joo']

In [15]:
# Finding list of words that does not have 2 or more consecutive 'o'... 
non_consecutive_oo = list(set(re.findall(r'[^o\(\)\s]+(?=o(?!o))\w+', lyrics)))
print("No. of words with non-consecutive oo: " + str(len(non_consecutive_oo)))
print("\n1st 5 words in the list: ")
non_consecutive_oo[:5]

# Decrypting: [^o\(\)\s]+ (?=o(?!o)) \w+
#     Where: [^o\(\)\s]+ - Negate characters matching 'o', '(', ')' and whitespace
#            (?=o(?!o)) - Match characters that have 1 or more 'o' in it but 
#                         not consecutively
#            \w+ - All characters therZZeafter...

No. of words with non-consecutive oo: 48

1st 5 words in the list: 


['policeman', 'long', 'together', 'duteous', 'Thou']

In [16]:
# Finding words that begins with 'co', ignoring case...
re.findall(r'(?i)co\w+', lyrics)

['corn', 'come', 'Corporation', 'come']

In [17]:
# As search only returns the first matched, we write a simple function that
# parse the lyrics and returns all matching results based on the regex pattern
def searchLyrics(pattern, lyrics):
    i = 0
    while i < len(lyrics):
      lyrics = lyrics[i:]
      match = re.search(pattern, lyrics)
      if match is not None:
        print(match.group(0))
        i = match.span()[1]
      else:
        break

# In this example, we utilize the concept groups based on regex pattern numbering
# position. Here we try to find groups of words that appear in the lyrics that
# appear consecutively:
searchLyrics(r'\b(\w+)\b(\s+\1)\b\2\b', lyrics)

# Decrypting: \b (\w+) \b (\s+\1) \b \2 \b
#     Where: \b - Sets the boundary of the pattern
#            (\w+) - This contains the pattern as a group to match 
#                     any series of characters
#            (\s+\1) - This matches the pattern (\w+) in positon 1 as described
#                      by the syntax \1. The \s+ in front matches any number of
#                      whitespaces that appear before the word.
#            \2 - This refers to position 2 of the pattern which is (\s+\1) which
#                 \1 refers to 

goo goo goo
ho ho ho
hee hee hee
hah hah hah
goo goo goo
goo goo goo


In [18]:
# We can acquire the same outcome with group name patterns:
searchLyrics(r'\b(?P<gib>[a-z]+)\b(\s+(?P=gib))\b(\s+(?P=gib))\b', lyrics)

# Decrypting: \b (?P<gib>[a-z]+) \b (\s+(?P=gib)) \b (\s+(?P=gib)) \b
#     Where: \b - Sets the boundary of the pattern
#            (?P<gib>[a-z]+) - This contains the pattern to match [a-z]+ which
#                              is any series of characters and assigned this
#                              pattern with the name 'gib' with the syntax:
#                              (?P<gib>...)
#            (?P=gib) - Recalls the same pattern named 'gib'
#            (\s+(?P=gib)) - Grouping the recalled name pattern with \s+ in front
#                            to capture the pattern named 'gib' with a whitespace
#                            in front.

goo goo goo
ho ho ho
hee hee hee
hah hah hah
goo goo goo
goo goo goo
