<a href="https://colab.research.google.com/github/dhamvi01/Regular-Expressions-Python/blob/master/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python RegEx
##### A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example

In [1]:
import re
pattern = '^a...s$'       #any five letter string starting with a and ending with s
test_string = 'abczz'
print(1,re.match(test_string,pattern))
test_string = 'abcss'
print(2,re.match(pattern,test_string))

1 None
2 <_sre.SRE_Match object; span=(0, 5), match='abcss'>


### Specify Pattern Using RegEx

To specify regular expressions, metacharacters are used. In the above example, ^ and $ are metacharacters.

#### Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

[] . ^ $ * + ? {} () \ |

#### 1. [] - Square brackets

Square brackets specifies a set of characters you wish to match.

In [2]:
pat = "India"
print(re.match("[I-a]",pat)) # string start with I and end with a
print(re.match("[^India]",pat))  # string except 'India'
pat = "12476"
print(re.match("[1-6]",pat)) # string start with 1 and end with 6
print(re.match("[^0-9]",pat)) # string with non digit character

<_sre.SRE_Match object; span=(0, 1), match='I'>
None
<_sre.SRE_Match object; span=(0, 1), match='1'>
None


### 2. . - Period

A period matches any single character

In [None]:
pat = "India"
print(re.match("[I...a]",pat)) # string start with I + three alphabets +  ends with a
pat = "abc"
print(re.match("[a.c]",pat)) # string start with a + one alphabet +  ends with c

<_sre.SRE_Match object; span=(0, 1), match='I'>
<_sre.SRE_Match object; span=(0, 1), match='a'>


### 3. ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

In [None]:
pat = "green"
print(re.match("^g",pat)) # String start with g 
pat = "red"
print(re.match("^r",pat)) # String start with r 

<_sre.SRE_Match object; span=(0, 1), match='g'>
<_sre.SRE_Match object; span=(0, 1), match='r'>


### 4. $ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

In [25]:
pat = "i like green"
print(re.search(r"green$",pat)) # String end with green 
pat = "red"
print(re.search("d$",pat)) # String end with d 

<_sre.SRE_Match object; span=(7, 12), match='green'>
<_sre.SRE_Match object; span=(2, 3), match='d'>


### 4. * - Star

The star symbol * matches zero or more occurrences of the pattern left to it.

In [29]:
pat = "Green"
print(re.search("G*n",pat)) # Matching string starting with G and ends with n 
pat = "Red"
print(re.search("Ra*d",pat)) # Matching string starting with Ra and ends with d 

<_sre.SRE_Match object; span=(4, 5), match='n'>
None


### 5. + - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

In [32]:
pat = "Green"
print(re.search("e+e",pat)) # String containing ee
pat = "Green"
print(re.search("r+n",pat)) # String containing rn

<_sre.SRE_Match object; span=(2, 4), match='ee'>
None


### 6. ? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

In [None]:
pat = "aa"
print(re.match("a?a",pat)) # String end with a 
pat = "abcd"
print(re.match("a?d",pat)) # String end with a 

<_sre.SRE_Match object; span=(0, 2), match='aa'>
None


### 5.{} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

In [39]:
pat = "Green"
print(re.search("e{1,2}",pat)) # e repeating > 1 and <= 2 
pat = "Green"
print(re.search("e{3,4}",pat)) # e repeating > 3 and <= 4 

<_sre.SRE_Match object; span=(2, 4), match='ee'>
None


### 6. | - Alternation

Vertical bar | is used for alternation (or operator).

In [46]:
pat = "green"
print(re.search("g|n",pat)) # String containing g and n 
pat = "red"
print(re.search("z|r",pat)) # String containing z or r 

<_sre.SRE_Match object; span=(0, 1), match='g'>
<_sre.SRE_Match object; span=(0, 1), match='r'>


### 7. () - Group

Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

In [50]:
pat = "green"
print(re.search("(g|e)n",pat)) # String match g or e followed by n 
pat = "green"
print(re.search("(g|r)c",pat)) # String match g or r followed by n

<_sre.SRE_Match object; span=(3, 5), match='en'>
None


#### \ - Backslash

Backlash \ is used to escape various characters including all metacharacters. For example,

\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

##### \A - Matches if the specified characters are at the start of a string.

In [52]:
pat = "The sky is blue"
print(re.search("\AThe",pat)) # The at start of string 
pat = "It is on the table"
print(re.search("\AThe",pat)) # The at start of string 

<_sre.SRE_Match object; span=(0, 3), match='The'>
None


#### \b - Matches if the specified characters are at the beginning or end of a word.

In [57]:
pat = "Cricket"
print(re.search(r"\bCri",pat)) # String starts with Cri
pat = "Hollywood"
print(re.search(r"\bwood",pat)) # String starts with wood

<_sre.SRE_Match object; span=(0, 3), match='Cri'>
None


#### \B - Matches if the specified characters are NOT at the beginning or end of a word.

In [60]:
pat = "Dollar"
print(re.search("\BDo",pat)) # string with 'Do' not in begining
pat = "Cricket"
print(re.match("\Bck",pat)) # string with 'ck' not in begining

None
None


#### \d - Matches any decimal digit. Equivalent to [0-9]

In [62]:
pat = "A1"
print(re.search("\d",pat)) # String containing numbers 
pat = "Green"
print(re.search("\d",pat)) # String containing numbers 

<_sre.SRE_Match object; span=(1, 2), match='1'>
None


#### \D - Matches any non-decimal digit. Equivalent to [^0-9]

In [65]:
pat = "1"
print(re.search("\D",pat)) # String NOT containing numbers
pat = "Green"
print(re.search("\D",pat)) # String NOT containing numbers

None
<_sre.SRE_Match object; span=(0, 1), match='G'>


#### \s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

In [67]:
pat = "I am a boy"
print(re.search("\s",pat)) # String containing space 
pat = "abc"
print(re.search("\s",pat)) # String containing space 

<_sre.SRE_Match object; span=(1, 2), match=' '>
None


\S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v]

In [72]:
pat = " "
print(re.search("\S",pat)) # String not containing space 
pat = "abc"
print(re.search("\S",pat)) # String not containing space

None
<_sre.SRE_Match object; span=(0, 1), match='a'>


\w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

In [73]:
pat = "Green"
print(re.match("\w",pat)) # String with nubers and characters 
pat = "@@@"
print(re.match("\w",pat)) # String with nubers and characters 

<_sre.SRE_Match object; span=(0, 1), match='G'>
None


\W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

In [74]:
pat = "Green"
print(re.search("\W",pat)) # String with nubers and characters 
pat = "@@@"
print(re.match("\W",pat)) # String with nubers and characters 

None
<_sre.SRE_Match object; span=(0, 1), match='@'>


\Z - Matches if the specified characters are at the end of a string.

In [77]:
pat = "football"
print(re.search("\bfoo",pat)) # String end with foo 
pat = "It is on the table"
print(re.search("table\Z",pat)) # String end with table 

None
<_sre.SRE_Match object; span=(13, 18), match='table'>


#### re.findall()
The re.findall() method returns a list of strings containing all matches.

In [79]:
string = '1 Green 2 Red 3 blue'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['1', '2', '3']


#### re.split()
The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

In [81]:
string = '1:red 2:green 3:blue'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

['', ':red ', ':green ', ':blue']


re.sub()
The syntax of re.sub() is:
The method returns a string where matched occurrences are replaced with the content of replace variable.

In [82]:
#re.sub(pattern, replace, string)
re.sub("A", "B", "AB")

'BB'

#### re.subn()
The re.subn() is similar to re.sub() expect it returns a tuple of 2 items containing the new string and the number of substitutions made.

In [84]:
# multiline string
string = 'red 1\
blue 2 \n green 3'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

('red1blue2green3', 4)


#### re.search()
The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

If the search is successful, re.search() returns a match object; if not, it returns None

In [None]:
match = re.search("x", "xyz")

In [87]:
string = "Sky is blue"

# check if 'sky' is at the beginning
match = re.search('\ASky', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

pattern found inside the string



#### Match object
You can get methods and attributes of a match object using dir() function.

Some of the commonly used methods and attributes of match objects are:

match.group()
The group() method returns the part of the string where there is a match.



In [88]:
string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

801 35


#### match.start(), match.end() and match.span()
The start() function returns the index of the start of the matched substring. Similarly, end() returns the end index of the matched substring.

In [None]:
>>> match.start()
2
>>> match.end()
8

8