# "Regular Expressions"

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- author: Abhinav Verma


# Introduction


Regex or regular expressions is in many ways a language within a language. It has its own syntax. It is used extensively in pattern matching. Many programmers make one of two mistakes regarding __regex__ , either they overuse it on every problem or they completely steer clear of it. To be honest , I'm not a fan of regexes. But as a machine learning engineer, you also need to spend a lot of work analyzing and cleaning the data to find patterns on it. You even need to scrape data from sites sometimes. These are places where knowledge of regular expressions comes in handy. So hopefully, this blog of my notes on learning regex can help a novice to navigate the complex world of regex. Even if your job doesn't involve working with data everyday, you will come across some regex code written by someone else , hence a knowledge of regex is vital. 

There are a ton of libraries that also use regex for extracting patterns like urls from text. I've used some and they also come in handy , so I'll cover a couple of them here as well.

# Basic regex using inbuilt functions

Python has an inbuilt library __re__ for handling regex operations. But even before that , there are a lot of ways you can match simple patterns using the basic functionality of strings and numbers

In [1]:
s = 'foo123bar'

Now you need to find if the word string foo exists within s. Strings in python have this unique property of being iterable over its individual characters. You can also find a continuous pattern using the __in__ property.

In [2]:
"foo" in s

True

In [3]:
s.find("foo") #This gives the starting index of the pattern

0

In [4]:
s.find("123")

3

Let's also check for phone numbers now.

You have different strings of phone numbers

In [10]:
phone1 = "022-64519300"

phone2 = "+919820481234"

not_phone1 = "101 Howard"

In [7]:
import string

In [8]:
string.digits

'0123456789'

In [11]:
def check_phone(inp):
    valid_chars = string.digits + ' -()+'
    for char in inp:
        if char not in valid_chars:
            return False
    return True

In [12]:
check_phone(phone1)

True

In [13]:
check_phone(phone2)

True

In [14]:
check_phone(not_phone1)

False

Above we defined a custom function which iterates over all characters in string and checks if it matches the valid characters that are usually present in phone numbers

# Enter the Regex

Now there's a lot of simple use cases like the one above that can be solved without using regex , however the more complex the problem becomes simple string pattern matching can fail.

One really useful to get started in regex is this link
https://regexone.com/

In [15]:
import re #import the re module for handling regular expressions in python

In [16]:
s1,s2,s3,s4,s5,s6 = "can","fan","man","dan","ran","pan"

In [18]:
s1

'can'

Now we need to match the first 3 strings without matching the last 3. So if we look at the strings we can see a pattern. The string 'an' is common across all strings but the letters c,f,m are only present in the first 3 strings.
There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern [abc] will only match a single a, b, or c letter and nothing else.

Let's use the search function in the __re__ module to search for the particular patterns we want

In python if you want to search for patterns either you can write the string or enclose the pattern inside the __r""__ expression. This tells to python to search for this pattern.
If the search finds a pattern it returns a Match object which is truthy object and tells if the match is found or not

In [20]:
re.search(r"['cfm']",s1)  # pattern searches if a string contains either c or f or m in the strinv

<re.Match object; span=(0, 1), match='c'>

In [21]:
re.search(r"['cfm']",s2)

<re.Match object; span=(0, 1), match='f'>

In [23]:
re.search(r"['cfm']",s6) # No match found hence nothing returned

In [24]:
re.search(r"['cfm']an",s1) # You can also search for the pattern of either c or f or m along with an

<re.Match object; span=(0, 3), match='can'>

__Excluding specific characters__

In some cases, we might know that there are specific characters that we don't want to match.
To represent this, we use a similar expression that excludes specific characters using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.

So if we take a look at the previous problem. We can reframe it as finding strings which do not have d,r or p

In [27]:
re.search(r"[^'drp']an",s2) # s2 does not have either of the 3 characters

<re.Match object; span=(0, 3), match='fan'>

__Multiple matches__

Suppose we want to match a set of characters. Can be any letter but it has to occur atleast 3-5 times in a text. For e.g you want to find patterns where people speak in a casual slang like wasssuppp . One way to check for multiple occurences is using the square brackets method we used earlier. Just keep repeating it as many times as you want so for e.g

In [28]:
re.search(r"[z][z][z]","Wazzzup") # search for mutliple occurences of uppercase or lowercase characters 

<re.Match object; span=(2, 5), match='zzz'>

But honestly , how many times can you keep repeating this ? So regex provides a way out you can enclose a curly bracket where you specify a range which explains how many matches are you looking for

In [31]:
re.search(r"[z]{2,5}","Wazzzup") # we are searching for multiple occurences of z between 2 and 5 times

<re.Match object; span=(2, 5), match='zzz'>

# Regex metacharacters

Now Regex has a ton of metacharacters. What are metacharacters? Well, how do I put it in laymans terms. They are characters that are useful in regex for finding patterns. They look similar to characters in python except that their meaning completely changes. Now this isn't the most technical definition that you'll find. For that , there are other blogs that have defined it better. I'll link them below. But more than definitions let's just start looking at some examples