# "Regular Expressions"

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- author: Abhinav Verma


# Introduction


Regex or regular expressions is in many ways a language within a language. It has its own syntax. It is used extensively in pattern matching. Many programmers make one of two mistakes regarding __regex__ , either they overuse it on every problem or they completely steer clear of it. To be honest , I'm not a fan of regexes. But as a machine learning engineer, you also need to spend a lot of work analyzing and cleaning the data to find patterns on it. You even need to scrape data from sites sometimes. These are places where knowledge of regular expressions comes in handy. So hopefully, this blog of my notes on learning regex can help a novice to navigate the complex world of regex. Even if your job doesn't involve working with data everyday, you will come across some regex code written by someone else , hence a knowledge of regex is vital. 

There are a ton of libraries that also use regex for extracting patterns like urls from text. I've used some and they also come in handy , so I'll cover a couple of them here as well.

# Basic regex using inbuilt functions

Python has an inbuilt library __re__ for handling regex operations. But even before that , there are a lot of ways you can match simple patterns using the basic functionality of strings and numbers

In [1]:
s = 'foo123bar'

Now you need to find if the word string foo exists within s. Strings in python have this unique property of being iterable over its individual characters. You can also find a continuous pattern using the __in__ property.

In [2]:
"foo" in s

True

In [3]:
s.find("foo") #This gives the starting index of the pattern

0

In [4]:
s.find("123")

3

Let's also check for phone numbers now.

You have different strings of phone numbers

In [10]:
phone1 = "022-64519300"

phone2 = "+919820481234"

not_phone1 = "101 Howard"

In [7]:
import string

In [8]:
string.digits

'0123456789'

In [11]:
def check_phone(inp):
    valid_chars = string.digits + ' -()+'
    for char in inp:
        if char not in valid_chars:
            return False
    return True

In [12]:
check_phone(phone1)

True

In [13]:
check_phone(phone2)

True

In [14]:
check_phone(not_phone1)

False

Above we defined a custom function which iterates over all characters in string and checks if it matches the valid characters that are usually present in phone numbers

# Enter the Regex

Now there's a lot of simple use cases like the one above that can be solved without using regex , however the more complex the problem becomes simple string pattern matching can fail.

One really useful to get started in regex is this link
https://regexone.com/

In [15]:
import re #import the re module for handling regular expressions in python

In [16]:
s1,s2,s3,s4,s5,s6 = "can","fan","man","dan","ran","pan"

In [18]:
s1

'can'

Now we need to match the first 3 strings without matching the last 3. So if we look at the strings we can see a pattern. The string 'an' is common across all strings but the letters c,f,m are only present in the first 3 strings.
There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern [abc] will only match a single a, b, or c letter and nothing else.

Let's use the search function in the __re__ module to search for the particular patterns we want

In python if you want to search for patterns either you can write the string or enclose the pattern inside the __r""__ expression. This tells to python to search for this pattern.
If the search finds a pattern it returns a Match object which is truthy object and tells if the match is found or not

In [20]:
re.search(r"['cfm']",s1)  # pattern searches if a string contains either c or f or m in the strinv

<re.Match object; span=(0, 1), match='c'>

In [21]:
re.search(r"['cfm']",s2)

<re.Match object; span=(0, 1), match='f'>

In [23]:
re.search(r"['cfm']",s6) # No match found hence nothing returned

In [24]:
re.search(r"['cfm']an",s1) # You can also search for the pattern of either c or f or m along with an

<re.Match object; span=(0, 3), match='can'>

__Excluding specific characters__

In some cases, we might know that there are specific characters that we don't want to match.
To represent this, we use a similar expression that excludes specific characters using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.

So if we take a look at the previous problem. We can reframe it as finding strings which do not have d,r or p

In [27]:
re.search(r"[^'drp']an",s2) # s2 does not have either of the 3 characters

<re.Match object; span=(0, 3), match='fan'>

__Multiple matches__

Suppose we want to match a set of characters. Can be any letter but it has to occur atleast 3-5 times in a text. For e.g you want to find patterns where people speak in a casual slang like wasssuppp . One way to check for multiple occurences is using the square brackets method we used earlier. Just keep repeating it as many times as you want so for e.g

In [38]:
re.search(r"[z][z][z]","Wazzzup") # search for mutliple occurences of uppercase or lowercase characters 

True

But honestly , how many times can you keep repeating this ? So regex provides a way out you can enclose a curly bracket where you specify a range which explains how many matches are you looking for

In [31]:
re.search(r"[z]{2,5}","Wazzzup") # we are searching for multiple occurences of z between 2 and 5 times

<re.Match object; span=(2, 5), match='zzz'>

# Regex metacharacters

Now Regex has a ton of metacharacters. What are metacharacters? Well, how do I put it in laymans terms. They are characters that are useful in regex for finding patterns. They look similar to characters in python except that their meaning completely changes. Now this isn't the most technical definition that you'll find. For that , there are other blogs that have defined it better. I'll link them below. But more than definitions let's just start looking at some examples

In [32]:
match_1 = "aaaabcc" 
match_2 = "aabbbbc"
skip = "a"

To match the match_{} variables and skip the skip variables , you would need to match one or more occurences of a,one or more b and one or more c

In [33]:
re.match(r"a+b+c+",match_1)

<re.Match object; span=(0, 7), match='aaaabcc'>

In [34]:
re.match(r"a+b+c+",match_2)

<re.Match object; span=(0, 7), match='aabbbbc'>

In [36]:
re.match(r"a+b+c+",skip) #Nothing

In [37]:
# You can even wrap this in bool
bool(re.match(r"a+b+c+",skip)) 

False

re.search and re.match can be wrapped in bool to give boolean values for matches

Let's revisit the earlier regex. + matches one or more occurences of the preceding character. a+ will match one or more of a, b+ does the same for b and c+ does the same for c. So match variables get matched and the skip variables get skipped

What if you had to match zero or more b's

In [40]:
re.match(r"a+b*c+",match_1) # * matches zero or more occurences

<re.Match object; span=(0, 7), match='aaaabcc'>

__Optional Matches__

This example has been ripped of from https://regexone.com/lesson/optional_characters? . I personally found this site to be really good practice on the concept of regex

In [44]:
match_1,match_2,skip = "1 file found?","24 files found?", "No files found."

? (question mark) metacharacter which denotes optionality. This metacharacter allows you to match either zero or one of the preceding character or group. For example, the pattern ab?c will match either the strings "abc" or "ac" because the b is considered optional. (Again ripped off from the site, it's really good)

In [45]:
match_1

'1 file found?'

So what do we need to match. Well you need to match numbers . Ordinarily that should be enough. so a \d would work.

But let's go further. to match file we would add a file. The s is optional so add a ? after s to indicate zero or more. Even a * would work though. Towards the end we have a ?. Now usually the ? mark is used in regex to indicate optionality. It's a metacharacter . (See what I did there). To match the normal ? you would need to escape it with a backslash(\) character

In [42]:
re.match(r"\d+files?found\?",match_1) #Didn't work. Because we didn't take into consideration space

In [47]:
re.match(r"\d+\sfiles?\sfound\?",match_1)

<re.Match object; span=(0, 13), match='1 file found?'>

In [49]:
re.match(r"\d+\sfiles*\sfound\?",skip) #it works

dot (.)

The . metacharacter matches any single character except a newline:

In [50]:
re.search('foo.bar', 'fooxbar')

<re.Match object; span=(0, 7), match='fooxbar'>

In [53]:
re.search(r'foo\.bar', 'foo.bar')

<re.Match object; span=(0, 7), match='foo.bar'>

Imagine for example we wanted to match the word "success" in a log file. We certainly don't want that pattern to match a line that says "Error: unsuccessful operation"! That is why it is often best practice to write as specific regular expressions as possible to ensure that we don't get false positives when matching against real world text.

In [57]:
match = "Mission: successful"
skip = "Next Mission: successful upon capture of target"

So over here we want to match string starting with Mission and ending with successful. You don't want any other things

In [59]:
re.search(r"^Mission:\s*successful$",match) # The ^ signifies string should start with the proceeding character and the $
#specifies string should end with preceding character

<re.Match object; span=(0, 19), match='Mission: successful'>

In [60]:
re.search(r"^Mission:\s*successful$",skip)

https://realpython.com/regex-python/

Now that we've started on this path of regex . The above link has some really good resources for all the metacharacters. We've only covered a few. The rest would be covered over here and more from practice. Now that we've covered searching and matching and finding some patterns. We can actually go over to the other section

## Difference between re.search and re.match

For re.match to work the pattern must match from the start of the string

In [55]:
re.match(r'foo.bar','afooxbar')

In [56]:
re.search(r'foo.bar','afooxbar')

<re.Match object; span=(1, 8), match='fooxbar'>

# Search and Group using Regex

Suppose you have a large text and you want to extract specific parts from it. Say you're checking a text for extracting urls from it or you have a text and you want to extract only the date from it. This is where regex groups come in handy. Let's look at a simple example.

Referenced from - https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

In [61]:
statement = 'Please contact us at: support@datacamp.com' # We need to extract email from here along with

In [62]:
match = re.search(r'([\w\.-]+)@([\w\.-]+)', statement)

The paranthesis above group the extracted text in tuples which can then be accessed by indexing them

In [68]:
match.group() # This is the main group

'support@datacamp.com'

In [70]:
match.groups() # These are the sub-groups denoted by the parenthesis inside the main paranthesis

('support', 'datacamp.com')

When you group the data it is grouped into the main group . However you can also divide the group into subgroups by adding parenthesis inside the main one indicating a separate subgroup. This can be accessed by the function groups() as shown above. The groups stores all subgroups in a tuple which can be accessed via index. And just like a tuple we also have a named version

Named groups will make your code more readable. The syntax for creating named group is: (?P<name>...). Replace the name part with the name you want to give to your group.

In [74]:
statement = 'Please contact us at: support@averma12.com'
match = re.search(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)
if statement:
    
      print("Email address:", match.group('email'))
      print("Username:", match.group('username'))
      print("Host:", match.group('host'))

Email address: support@averma12.com
Username: support
Host: averma12.com


# Find all and substitute Regex

Now imagine a problem where you have a ton of emails or you have text where you want to extract all the urls present in a given text. You want all the information in an iterable like a list or an iterator which then allows you to later store or work on this.

In [76]:
statement = """
1.8GHz dual-core Intel Core i5, Turbo Boost up to 2.9GHz, with 3MB shared L3 cache, Configurable to 2.2GHz dual-core Intel Core i7, Turbo Boost up to 3.2GHz, with 4MB shared L3 cache
"""

In [77]:
statement

'\n1.8GHz dual-core Intel Core i5, Turbo Boost up to 2.9GHz, with 3MB shared L3 cache, Configurable to 2.2GHz dual-core Intel Core i7, Turbo Boost up to 3.2GHz, with 4MB shared L3 cache\n'

We want to extract the numbers from this.

Step 1. remove the new line \n

In [78]:
statement = statement.strip()

In [79]:
statement

'1.8GHz dual-core Intel Core i5, Turbo Boost up to 2.9GHz, with 3MB shared L3 cache, Configurable to 2.2GHz dual-core Intel Core i7, Turbo Boost up to 3.2GHz, with 4MB shared L3 cache'

In [81]:
re.findall(r"(\d\.?\w+)",statement,re.I)

['1.8GHz', '2.9GHz', '3MB', '2.2GHz', '3.2GHz', '4MB']

The above command re.findall gets all the regex matches from a string in a list. , re.I is the optional third parameter in every regex function . re.I indicates ignore case

## Find all urls in a string

Let's try to find all the urls in a string. Our goal is to filter out all the urls in a string. This will make use of 2 regex functions. Regex findall and also re.sub to substitute the pattern with a string '<-URL->. This tokenization is done a lot to train models in text classification. URLs in many cases are substituted.
Let's write a function to this and test it on some text.

In [83]:
def replace_urls(in_string):
    
    """Replace URLs in strings. See also: ``bit.ly/PyURLre``

    Args:
        in_string (str): string to filter
    

    Returns:
        str
    """
    
    replacement = '<-URL->'
    pattern = re.compile('(https?://)?(\w*[.]\w+)+([/?=&]+\w+)*')
    return re.sub(pattern, replacement, in_string)

Let's look at the pattern

(https?://)? - You look for http or https pattern with :// pattern that you see with url. The ? at the end marks the pattern as optional. Putting the entire pattern in () groups it.
The second group searches for alphanumeric characters with a . and more alphanumeric characters. One or more times
The final group searches for literal ? for query params and =& along with alphanumeric characters.

Let's take some example strings

In [84]:
string = "My Profile https://auth.abc.org/user/averma/articles in the portal of http://www.abc.org/. Also you can find me on twitter.com and www.facebook.com ."

In [85]:
replace_urls(string)

'My Profile <-URL-> in the portal of <-URL->/. Also you can find me on <-URL-> and <-URL-> .'

There are some patterns that the above regex doesn't cover. Adding all cases would result in a huge regex. 
There are some libraries that take care of this and you can get away with not using regex. We'll be using the library urlextract

In [86]:
#! pip install urlextract

In [87]:
from urlextract import URLExtract

In [90]:
extractor = URLExtract()

In [91]:
extractor.find_urls(string)

['https://auth.abc.org/user/averma/articles',
 'http://www.abc.org/.',
 'twitter.com',
 'www.facebook.com']

In [96]:
def replace_urls(in_string):
        
    replacement = '<-URL->'
    extractor = URLExtract()
    urls = extractor.find_urls(in_string)
    for url in urls:
        in_string = in_string.replace(url,replacement)

    return in_string

    
    

In [97]:
replace_urls(string)

'My Profile <-URL-> in the portal of <-URL-> Also you can find me on <-URL-> and <-URL-> .'

# References
There are some amazing sources for learning regex


https://github.com/fastai/course-nlp/blob/master/4-regex.ipynb Fastai nlp course has an amazing video on regex

https://realpython.com/regex-python/ - Realpython is just an amazing resource for anything python


https://regexone.com/ - Amazing resource to practice regex for beginners


This tutorial is intended to be a public resource. As such, if you see any glaring inaccuracies or if a critical topic is missing, please feel free to point it out or (preferably) submit a pull request to improve the tutorial. Also, we are always looking to improve the scope of this article. For anything feel free to mail us @ colearninglounge@gmail.com

Author is __Abhinav Verma__ you can find him on [LinkedIn]("https://linekdin.com/vermaonline")