# Regular Expressions

Regular expressions also called as REs, or regexes, or regex patterns. Regular expressions are used to match a pattern of a particular string to extract text out of it.
<br> Example 1:
<br> Ctrl + F in MS Word. A person can search for pattern in the word document.
<br> Example 2:
<br> Splitting an email id to get the user name and the domain separately is also an application of regex.
<br> Here we had split the email id by '@', here we had unknowingly used regex. 

### How do you know a problem can be solved using regex ??
The text you want to extract out of huge text data, does it have a consistent pattern?
<br> If the answer is yes, then regular expression will help you solve this problem.
<br> Now the only skill you need to possess to solve a problem using regex is to construct a proper regex.
<br> That is what you will learn in this chapter.

Steps to solve a problem involving regular expressions.
<br> Identify a pattern
<br> Convert the pattern in the form of regular expression 
<br> Test the correctness of the regex
<br> Perform necessary operation eg. search, match, extract indices.
 

## Approach to writing regular expressions:

There can be multiple ways to write a regular expression.

The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.

We will be using "re" module in python.

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

Read more on details of how regular expressions are handled by the computer at : https://docs.python.org/3.6/library/re.html


## How to write a regular expression?

        "."      Matches any character except a newline.
        "^"      Matches the start of the string.
        "$"      Matches the end of the string or just before the newline at
                 the end of the string.
        "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
                 Greedy means that it will match as many repetitions as possible.
        "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
        "?"      Matches 0 or 1 (greedy) of the preceding RE.
        "\\"     Either escapes special characters or signals a special sequence.
        []       Indicates a set of characters.
                 A "^" as the first character indicates a complementing set.
        "|"      A|B, creates an RE that will match either A or B.

#### Now we are going to import re module .

In [1]:
import re

### Problem statement:
Let us form a regular expression for a mobile number validation.

### Construction of regex: 
A valid mobile number has 10 characters and they have to be decimal numbers.
<br> We have to check whether each and every character is a number.
<br> To verify if a character is a number the regex is [0-9]
<br> We need to repeat this 10  times for each and every character, the corresponding regex would be [0-9]{10}.


In [2]:
def verifyMobileNo(mobile_no):
    if re.match("[0-9]{10}",mobile_no):
        return 1
    else:
        return 0

In [3]:
if verifyMobileNo("81273823f8") == 1:
    print("Valid mobile no")
else:
    print("Invalid mobile no")

Invalid mobile no


In [4]:
if verifyMobileNo("8127382398") == 1:
    print("Valid mobile no")
else:
    print("Invalid mobile no")

Valid mobile no


### Problem statement:
Let us form a regular expression for a mobile number validation with format +91-number

### Construction of regex: 
Mobile numbers in India start with +91 and then hyphen and then the number.
<br> First we will write +91.
<br> Problem here is that + is a special character as you can see above.
<br> To show that + is a part of the pattern we need to match we have too add a backslash(\) before it.
<br> Then backslash hyphen (Hyphen is a special character).
<br> The number can be any number between 1 and 9 followed by 9 numbers ranging from 0 to 9.

In [5]:
def verifyMobileNoWithplus91(mobile_no):
    if re.match("\+91\-[1-9][0-9]{9}", mobile_no):
        return 1
    else:
        return 0

Correct mobile no

In [6]:
if verifyMobileNoWithplus91("+91-8127382398") == 1:
    print("Valid mobile no")
else:
    print("Invalid mobile no")

Valid mobile no


Incorrect mobile no - Mobile number is incomplete(9 digits only)

In [7]:
if verifyMobileNoWithplus91("+91-812738239") == 1:
    print("Valid mobile no")
else:
    print("Invalid mobile no")

Invalid mobile no


Incorrect mobile no - Mobile number is not having +91 before it

In [8]:
if verifyMobileNoWithplus91("+9-8127382398") == 1:
    print("Valid mobile no")
else:
    print("Invalid mobile no")

Invalid mobile no


### Problem statement:
Complex password rules:
<br> The password contains characters from three of the following categories:
<br> One upper case character should be present.
<br> One lower case character should be present.
<br> A number should be present in the password.
<br> A special character needs to present.
<br> The length of password should be greater than or equal to 6.

### Construction of regex: 


[A-Z] or \W upper case characters (* because 0 or more upper case characters could be there)
<br>[a-z] or \w lower case characters (* because 0 or more lower case characters could be there)
<br>[^A-Za-z0-9] special characters (* because 0 or more special case characters could be there)
<br>[a-zA-Z0-9]* means any combination of alphanumeric characters (upper and lower case letters)
<br>[A-Za-z\d$@$!%*?&#] means combination of alphanumeric and special characters 
    
You can test your regex expressions at the following website and check the explanation for the following regex.
<br> https://regex101.com/r/Q7ytgY/1

^(?=[A-Z](?=.+[a-z]))(?=.+\d)(?=.+[$@$!%*?&#])[A-Za-z\d$@$!%*?&#]{6,}

https://regex101.com/r/Q7ytgY/2
                                   
                                   
                                

In [9]:
import re
def check_password(password):
    
    if re.match(r"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[$@$!%*?&#])[A-Za-z\d$@$!%*?&#]{6,}",password):
        return 1
    else:
        return 0

^ - matches from start of string
<br>() - to match a particular group of characters
<br>?= - Positive lookahead (https://www.regular-expressions.info/lookaround.html) (Like if condition)
<br>. - Dot matches any character(https://www.regular-expressions.info/dot.html)
<br>* - greedy 
<br>.* - Match all characters 
<br>[a-z] - Lower case characters

Together, (?=.*[a-z]) means all lower case characters.
This same pattern is followed for upper case, numbers as well as special characters.

In the end, [A-Za-z\d$@$!%*?&#] means collection of all characters- upper case, lower case , digits and special characters.
{6,} means overall length has to be minimum 6.

Correct password - All rules followed

In [10]:
if check_password("Pass12#"):
    print("Valid password")
else:
    print("Invalid password")

Valid password


Incorrect password - No lower case character present

In [11]:
if check_password("PPPP1#"):
    print("Valid password")
else:
    print("Invalid password")

Invalid password


Correct password - Multiple special characters (abides the rules)

In [12]:
if check_password("Pass1%2#"):
    print("Valid password")
else:
    print("Invalid password")

Valid password


Correct password - Multiple special characters (abides the rules)

In [13]:
if check_password("P##$1%2#"):
    print("Valid password")
else:
    print("Invalid password")

Invalid password


Incorrect password - No upper case character present

In [14]:
if check_password("$1%2#ksad"):
    print("Valid password")
else:
    print("Invalid password")

Invalid password


Correct password - All rules followed

In [15]:
if check_password("aPass12#"):
    print("Valid password")
else:
    print("Invalid password")

Valid password


Correct password - All rules followed

In [16]:
if check_password("aPass12#d"):
    print("Valid password")
else:
    print("Invalid password")

Valid password


Incorrect password - Less than characters present

In [17]:
if check_password("Pas$1"):
    print("Valid password")
else:
    print("Invalid password")

Invalid password


Incorrect password - No upper case and special characters present

In [18]:
if check_password("apass12"):
    print("Valid password")
else:
    print("Invalid password")

Invalid password


## Various operations on strings using the re module.
1. Find matching pattern.
2. Search for pattern in a string
3. Finding all matches of a pattern
4. Splitting a string, etc

### Find matching pattern

re.match(pattern, string, flags=0)

match() method is useful to match pattern in a string.
<br> It matches the pattern starting only from the beginning of the string.

In [20]:
students = [{"Name" : "Amit", "Class" : "2019-1", "Roll no" : 2},
           {"Name" : "Monik", "Class" : "2018-2", "Roll no" : 12},
           {"Name" : "Aditya", "Class" : "2019-3", "Roll no" : 25},
           {"Name" : "Zara", "Class" : "2017-3", "Roll no" : 20},
           {"Name" : "Karan", "Class" : "2016-4", "Roll no" : 21}] 

Find the students who class ends with 3.
<br> \* means 0 or more times
<br>
"\*"  Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.

In [21]:
for i in students:
    if re.match(".*3$", i['Class']) : 
        print(i)

{'Name': 'Aditya', 'Class': '2019-3', 'Roll no': 25}
{'Name': 'Zara', 'Class': '2017-3', 'Roll no': 20}


Find the names of students who are in class starting with 2019

In [22]:
for i in students:
    if re.match("^2019", i['Class']) : 
        print(i)

{'Name': 'Amit', 'Class': '2019-1', 'Roll no': 2}
{'Name': 'Aditya', 'Class': '2019-3', 'Roll no': 25}


Find the names of students who are in class starting with 2019 or 2016

In [23]:
for i in students:
    if re.match(r"^2019|^2016", i['Class']) : 
        print(i)

{'Name': 'Amit', 'Class': '2019-1', 'Roll no': 2}
{'Name': 'Aditya', 'Class': '2019-3', 'Roll no': 25}
{'Name': 'Karan', 'Class': '2016-4', 'Roll no': 21}


### Searching for a pattern in a string

re.search(pattern, string, flags=0)

search() method is useful to search pattern in a string. 
<br>It matches the pattern present anywhere in the string unlike match which matches the pattern only in the beginning of the string.

In [24]:
poem = """Dearest creature in Creation,
Studying English pronunciation,

I will teach you in my verse
Sounds like corpse, corps, horse and worse.

It will keep you, Susy, busy,
Make your head with heat grow dizzy;

Tear in eye your dress you'll tear.
So shall I! Oh, hear my prayer,"""

In [25]:
re.search("tion", poem)

<_sre.SRE_Match object; span=(24, 28), match='tion'>

tion is found at 24 and 28

### Finding all matches of a pattern in a string

re.findall(pattern, string, flags=0)

Return all unique matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

In [26]:
re.findall(".*tion",poem)

['Dearest creature in Creation', 'Studying English pronunciation']

In [27]:
re.findall(".*se" ,poem)

['I will teach you in my verse', 'Sounds like corpse, corps, horse and worse']

In [28]:
re.findall(".*y" ,poem)

['Study',
 'I will teach you in my',
 'It will keep you, Susy, busy',
 'Make your head with heat grow dizzy',
 'Tear in eye your dress y',
 'So shall I! Oh, hear my pray']

#### Match with a flag

Flags are used as a filtering technique.

re.IGNORECASE:
Perform case-insensitive matching; expressions like [A-Z] will also match lowercase letters.

re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '\$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. 



In [29]:
re.search("teach",poem,re.IGNORECASE)

<_sre.SRE_Match object; span=(70, 75), match='teach'>

## re.search vs re.findall

### re.search

Search only returns the first obtained match.

### re.findall

Find returns a list of all the matches found.

### Splitting a string

re.split(pattern, string, maxsplit=0, flags=0)

Split the source string by the occurrences of the pattern,returning a list containing the resulting substrings. 
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.  If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

In [30]:
re.split('@',"myname@domain.com")

['myname', 'domain.com']

To read more and explore more you can check below

In [31]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matches the string 'last'.
    
    The special characters are:
        "."      Matches any character except a newline.
        "^"      Matches the start of the string.
        "$"      Matches the end of the string or just before the newline at
                 the end of the string.
        "*"      Matches 0 or more (greedy) repetitions of the preceding RE.
                 Greedy 

As you can see in help(re) there are various special characters in re.
<br> We will go over a few of them.

To test and experiment with more regular expressions visit regex101.com

# Summary

You learnt how to :
1. Form regular expressions
2. Use regular expressions in re methods
3. Difference between re.search and re.findall

Now that you have learnt about regular expressions.
<br>Lets get to hands on practice by solving the assignment and applying regular expressions to fifa world cup data.

**All the best!!**