# Understanding Regex

As you're a software developer, you have probably encountered regular expressions many times and got consufed many times with these daunting set of characters grouped together like this:

<img src=".\Images\12.png">

And you may wondered what this is all about?

Regular Expressions(Regx or RegExp) are too useful in stepping up your algorithm game and this will make you a better problem solver. The structure of Regx can be intimidating at first, but it is very rewarding once you got all the patterns and implement them in your work properly.


## What is RegEx and why is it important?

A Regex or we called it as regular expression, it is a type of object will help you out to extract information from any string data by searching through text and find it out what you need.Whether it's punctuation, numbers, letters, or even white spaces, RegEx will allow you to check and match any of the character combination in strings.

For example, suppose you need to match the format of a email addresses or security numbers. You can utilize RegEx to check the pattern inside the text strings and use it to replace another substring.

For instance, a RegEx could tell the program to search for the specific text from the string and then to print out the output accordingly. Expressions can include Text matching, Repetition of words,Branching,pattern-composition.



### RegEx Syntax

    import re

- *re* library in Python is used for string searching and manipulation.
- We also used it frequently for web scraping.

#### Example for w+ and ^ Expression

- *^:* Here in this expression matches the start of a string.
- *w+:* This expression matches for the alphanumeric characters from inside the string.

Here, we will give one example of how you can use "w+" and "^" expressions in code. re.findall will cover in next parts,so just focus on the "w+" and "^" expression.

Let's have an example "Shivam23, Data Science Teaching Consultant", if we execute the code we will get "Shivam23" as a result.

## Use RegEx methods

The "re" packages provide several methods to actually perform queries on an input string. We will see different methods which are

    re.match()
    re.search()
    re.findall()
    
**Note:** Based on the RegEx, Python offers two different primitive operations. This match method checks for the match only at the begining of the string while search checks for a match anywhere in the string.


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

### Meta sequences

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |

#### 1. re.match(pattern, string)
*The re.match function returns a match object on success and none on failure.*

In [1]:
import re

In [38]:
# match a word at the beginning of a string

result = re.match('We', 'We will learn Regular Expression today.')
print(result)

<re.Match object; span=(0, 2), match='We'>


In [33]:
print(result.group())

We


In [37]:
result_2 = re.match('largest','We will learn Regular Expression today.')
print(result_2)

None


#### 2. re.search(pattern, string)
*Matches the first occurrence of a pattern in the entire string(and not just at the beginning).*

In [40]:
# search for the pattern "founded" in a given string

result = re.search('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

<re.Match object; span=(10, 17), match='founded'>


In [41]:
print(result.group())

founded


#### 3. re.findall(pattern, string)
*It will return all the occurrences of the pattern from the string. I would recommend you to use re.findall() always, it can work like both re.search() and re.match().*

In [42]:
result = re.findall('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result)

['founded', 'founded']


#### 4. sub method
*This function is used to replace all the occurrences of the RE pattern with the new string/pattern.*

In [80]:
input_str = "Hey, Are you excited??, After a 7 days, we 56 will be in Shimla!!!"

In [81]:
re.sub('\W+', ' ', input_str)

'Hey Are you excited After a 7 days we 56 will be in Shimla '

### Special Sequences in Regular Expressions
#### 1. \b
*\b returns a match where the specified pattern is at the beginning or at the end of a word.*

In [43]:
st = 'LetsUpgrad is the largest and smartest Ed-tech platform in India'

In [46]:
x = re.findall(r'\blar', st)
print(x)

['lar']


In [45]:
x = re.findall(r'\b\w+est\b', st)
print(x)

['largest', 'smartest']


#### 2. \d
\d returns a match where the string contains digits (numbers from 0-9).

In [48]:
s1 = '1 million monthly visit in Jan 24 and the code is 6789'

In [49]:
x = re.findall("\d", s1)
print(x)

['1', '2', '4', '6', '7', '8', '9']


In [50]:
x = re.findall("\d+", s1)
print(x)

['1', '24', '6789']


#### 3. \D
\D returns a match where the string does not contain any digit. It is basically the opposite of \d.

In [51]:
s1 = '1 million monthly visit in Jan 24 and the code is 6789'

In [52]:
x = re.findall("\D+", s1)
print(x)

[' million monthly visit in Jan ', ' and the code is ']


#### 4. \w
\w helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)

In [53]:
s2 = 'Shivam is a Data Science @Consultant at ? LetsUpgrad #development and he working over there since 2022!!'

In [54]:
x = re.findall('\w+', s2)
print(x)

['Shivam', 'is', 'a', 'Data', 'Science', 'Consultant', 'at', 'LetsUpgrad', 'development', 'and', 'he', 'working', 'over', 'there', 'since', '2022']


#### 5. \W
\W returns match at every non-alphanumeric character. Basically opposite of \w.

In [55]:
s2 = 'Shivam is a Data Science @Consultant at ? Upgrad #development and he working over there since 2022!!'

In [56]:
x = re.findall('\W+', s2)
print(x)

[' ', ' ', ' ', ' ', ' @', ' ', ' ? ', ' #', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '!!']


### Metacharacters in Regular Expression

Metacharacters are characters with a special meaning.

#### 1- (.) matches any character (except newline character)

In [57]:
s1 = "shivam and shivani recently published a research paper."

In [61]:
x = re.findall("shi....", s1)
print(x)

['shivam ', 'shivani']


#### 2– (^) starts with
*It checks whether the string starts with the given pattern or not.*

In [62]:
s2 = "Data Science Program will be launched in coming days"

In [65]:
x = re.findall("^Data", s2)
print(x)

['Data']


#### 3- ($) ends with
*It checks whether the string ends with the given pattern or not.*

In [66]:
s3 = "Data Science Program"

In [67]:
x = re.findall("Program$", s3)
print(x)

['Program']


#### 4- (*) matches for zero or more occurrences of the pattern to the left of it

In [69]:
s4 = "easy easssy eay ey easssssssy"

In [70]:
x = re.findall('eas*y', s4)  #the word should starts with 'ea' followed by 's' and ending with 'y'.
print(x)

['easy', 'easssy', 'eay', 'easssssssy']


#### 5- (+) matches one or more occurrences of the pattern to the left of it


In [71]:
s4 = "easy easssy eay ey"

In [72]:
x = re.findall('eas+y', s4)
print(x)

['easy', 'easssy']


#### 6- (?) matches zero or one occurrence of the pattern left to it.

In [73]:
s5 = "easy easssy eay ey"

In [74]:
x = re.findall('eas?y', s4)
print(x)

['easy', 'eay']


#### 7- (|) either or

In [75]:
sttr = 'LetsUpgrad is the largest Ed-Tech Platform in India'

In [78]:
x = re.findall("LetUpgrad|US", sttr)

In [79]:
x

[]

### Examples and Use Cases in Regular Expressions

**1. Extracting the phone number**

In [39]:
phn = 'My contact number is 222-446-7880 and my friend contact number is 879-354-15235'

if re.search("\d{3}-\d{3}-\d{4}", phn):
    print("It is verified")
else:
    print("Incorrect Number")

It is verified


In [40]:
re.findall("\d{3}-\d{3}-\d{4}", phn)

['222-446-7880', '879-354-1523']

**2. Validating Email Addresses**

In [8]:
ids = 'Shivam@gmail.com abhijit45@gmail.com tanya78@yahoo.com shivam@.com'

r2 = re.findall('[\w._%]{0,20}@[\w-].[A-Za-z]{2,3}', ids)

In [9]:
r2

['Shivam@gmail', 'abhijit45@gmail', 'tanya78@yahoo']

**3. Extracting Dates & Time**

In [10]:
date = 'My birth date is on 23-07-1997'

re.findall("\d{2}-\d{2}-\d{4}", date)

['23-07-1997']

**4. Cleaning Text Data**

In [11]:
inp = "Hey, Are you excited??, After a 2 days, we will be in Shimla for 7 days!!!"

In [12]:
re.sub('[^a-zA-Z]', ' ', inp)

'Hey  Are you excited    After a   days  we will be in Shimla for   days   '

### NLP Application with Regex
#### Tokenization

In [14]:
text = 'Walking, Talking, and Coding are interesting activities!!'

In [15]:
x = re.findall(r'\b\w+ing\b', text)
print(x)

['Walking', 'Talking', 'Coding', 'interesting']


In [16]:
tokens = re.split(r'\s+', text)
print(tokens)

['Walking,', 'Talking,', 'and', 'Coding', 'are', 'interesting', 'activities!!']


In [18]:
tokens[3]

'Coding'

#### Remove the sepecial chracters and punctuation

In [17]:
re.sub(r'[^\w\s]', '', text)

'Walking Talking and Coding are interesting activities'

### ***Q. Write a regular expression that matches a string where '45' occurs one or more times followed by the occurrence of '37' one or more times.***

In [23]:
sample = ['4537', '45453737', '454545453737', '45', '37', '45537', '445537']

In [24]:
pattern = '(45)+(37)+'

for i in sample:
    r = re.search(pattern, i)
    if r!= None:
        print(True)
    else:
        print(False)

True
True
True
False
False
False
False


### Few more examples

#### Using re.match()

The match function is used to match the RegEx pattern to string with optional flag. Here, in this "w+" and "\W" will match the words starting from "i" and thereafter ,anything which is not started with "i" is not identified. For checking match for each element in the list or string, we run the for loop.

In [54]:
li = ['icecream images', 'inner peace', 'shivam singh']

for i in li:
    q = re.match("(i\w+)\W(i\w+)", i)
    
    if q:
        print(q.groups())

('icecream', 'images')


#### Finding Pattern in the text(re.search())

A RegEx is commonly used to search for a pattern in the text. This method takes a RegEx pattern and a string and searches that pattern with the string.

For using re.search() function, you need to import re first. The search() function takes the "pattern" and "text" to scan from our given string and returns the match object when the pattern found or else not match.

In [59]:
phn = 'My contact number is 242-449-7890'

if re.search("\d{3}-\d{3}-\d{4}", phn):
    print("It is verified")
else:
    print("Incorrect Number")

Incorrect Number


#### Using re.findall() for text

We use re.findall() module is when you wnat to iterate over the lines of the file, it'll do like list all the matches in one go. Here in a example, we would like to fetch email address from the list and we want to fetch all emails from the list, we use re.findall() method.

In [85]:
email = "shivam96@gmail.com shivani@.com abhinav67@yahoo.com"

In [86]:
p = re.findall("[\w._%]{0,20}@[\w-].[A-Za-z]{2,3}", email)

In [87]:
p

['shivam96@gmail', 'abhinav67@yahoo']

In [65]:
k = 'Shivam@upgrad.com abhijit45@gmail.com tanya78@yahoo.com'

In [66]:
r2 = re.findall(r'[\w\.-]+@[\w\.-]+', k)

In [67]:
r2

['Shivam@upgrad.com', 'abhijit45@gmail.com', 'tanya78@yahoo.com']

### Greedy vs Non Greedy quantifiers

#### Greedy Match –
*A greedy match in regular expression tries to match as many characters as possible.*

*For example [0-9]+ will try to match as many digits as possible. It gets never enough of it. It’s too greedy.*

In [10]:
import re
re.findall('[0-9]+', '12345678910')

['12345678910']

In [11]:
# zero or more occurrences
re.findall('[0-9]*', '12345678910')

['12345678910', '']

In [12]:
# one or more occurrences
re.findall('[0-9]+?', '12345678910')

['1', '2', '3', '4', '5', '6', '7', '8', '9', '1', '0']

In [5]:
# zero or one occurrences
re.findall('[0-9]?', '12345678910')

['1', '2', '3', '4', '5', '6', '7', '8', '9', '1', '0', '']

In [6]:
# exactly 5 occurrences
re.findall('[0-9]{5}', '12345678910')

['12345', '67891']

In [7]:
# at least 2 but not greater than 5
re.findall('[0-9]{2,5}', '12345678910')

['12345', '67891']

#### Non Greedy Match –
A Non-Greedy match tries match as few characters as possible.

In [8]:
re.findall('[0-9]*', '12345678910')

['12345678910', '']

In [9]:
re.findall('[0-9]*?', '12345678910')

['',
 '1',
 '',
 '2',
 '',
 '3',
 '',
 '4',
 '',
 '5',
 '',
 '6',
 '',
 '7',
 '',
 '8',
 '',
 '9',
 '',
 '1',
 '',
 '0',
 '']