A RegEx, or Regular Expression, is a special sequence of characters that forms a search pattern.

RegEx can be used to check if a **string** contains the specified search pattern.

RegEx can also split a pattern into one or more sub patterns.

Python, provides a re module whose primary function is to offer a search, where it takes a regular expression and a string.



In [3]:
import re

In [4]:
txt="The rain in Spain"
x=re.search("^The.*Spain$", txt)

In [5]:
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

### RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

findall: Returns a list containing all.

search: Returns a Match object if there is a match anywhere in the string.

split: Returns a list where the string has been split at each match.

sub: Replaces one or many matches with a string.

### Metacharacters

Metacharacters are characters with a special meaning:

In [31]:
# []	A set of characters

txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)

y=re.search("[a-m]", txt)
print(y)


['h', 'e', 'a', 'i', 'i', 'a', 'i']
<re.Match object; span=(1, 2), match='h'>


"" \ "" : Signals a special sequence (can also be used to escape special characters)

In [8]:
txt = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", txt)
print(x)

['5', '9']


"". ""	Any character (except newline character)

In [9]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..o", txt)
print(x)


['hello']


"" ^ ""	Starts with

In [10]:
txt = "hello planet"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")


Yes, the string starts with 'hello'


""$ ""	Ends with

In [14]:
x = re.findall("planet$", txt)
#Check if the string ends with 'planet':
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")

Yes, the string ends with 'planet'


"" * ""  Zero or more occurrences

In [22]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*o", txt)

print(x)

['hello']


"" + ""	One or more occurrences

In [18]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)

print(x)

['hello']


"" ? ""	Zero or one occurrences

In [25]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?o", txt)

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

[]


"" {} ""	Exactly the specified number of occurrences

In [26]:
#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall("he.{2}o", txt)

print(x)

['hello']


"" | ""	Either or

In [27]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['falls']
Yes, there is at least one match!


## Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

**\A** :	Returns a match if the specified characters are at the beginning of the string

In [32]:
txt = "The rain in Spain"

#Check if the string starts with "The":

x = re.findall("\AThe", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")


['The']
Yes, there is a match!


**\b** : Returns a match where the specified characters are at the beginning or at the end of a word.


In [33]:
#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


In [35]:
#Check if "ain" is present at the End of a WORD:
x=re.findall(r"ain\b", txt)
print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['ain', 'ain']
Yes, there is at least one match!


**\B** :	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word

In [36]:
#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [37]:
#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r"ain\B", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


**\d** : Returns a match where the string contains digits (numbers from 0-9)

In [38]:
#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


**\D** : Returns a match where the string DOES NOT contain digits

In [39]:
#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**\s** :	Returns a match where the string contains a white space character

In [40]:
#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


**\S** :	Returns a match where the string DOES NOT contain a white space character

In [41]:
#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**\w** :	Returns a match where the string contains any word characters

In [42]:
#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**\W** :	Returns a match where the string DOES NOT contain any word characters

In [43]:
#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


**\Z** :	Returns a match if the specified characters are at the end of the string	"Spain\Z"

In [47]:
#Check if the string ends with "Spain":

x = re.findall("Spain\Z", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")


['Spain']
Yes, there is a match!


## Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

**[arn]** :	Returns a match where one of the specified characters (a, r, or n) is present

In [48]:
#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['r', 'a', 'n', 'n', 'a', 'n']
Yes, there is at least one match!


**[a-n]** :	Returns a match for any lower case character, alphabetically between a and n

In [50]:
#Check if the string has any characters between a and n:
x = re.findall("[a-n]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
Yes, there is at least one match!


**[^arn]** :	Returns a match for any character EXCEPT a, r, and n

In [51]:
#Check if the string has other characters than a, r, or n:

x = re.findall("[^arn]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
Yes, there is at least one match!


**[0123]**	Returns a match where any of the specified digits (0, 1, 2, or 3) are present

In [52]:
#Check if the string has any 0, 1, 2, or 3 digits:

x = re.findall("[0123]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


**[0-9]**	: Returns a match for any digit between 0 and 9

In [59]:
txt = "8 times before 11:45 AM"
#Check if the string has any digits:

x = re.findall("[0-9]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['8', '1', '1', '4', '5']
Yes, there is at least one match!


**[0-5][0-9]** :	Returns a match for any two-digit numbers from 00 and 59

In [54]:
txt = "8 times before 11:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:

x = re.findall("[0-5][0-9]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['11', '45']
Yes, there is at least one match!


**[a-zA-Z]** :	Returns a match for any character alphabetically between a and z, lower case OR upper case

In [55]:
#Check if the string has any characters from a to z lower case, and A to Z upper case:

x = re.findall("[a-zA-Z]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!


**[+]** :	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string	


In [58]:
txt = "8 times before 11:45 AM"

#Check if the string has any + characters:

x = re.findall("[+]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


## The findall() Function

The findall() function returns a list containing all matches.

In [60]:
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

In [61]:
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


## The search() Function

The search() function searches the string for a match, and returns a **Match object** if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

In [62]:
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())


The first white-space character is located in position: 3


If no matches are found, the value None is returned:

In [63]:
x = re.search("Portugal", txt)
print(x)

None


### The split() Function

The split() function returns a list where the string has been split at each match:

In [64]:
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the maxsplit parameter:


In [65]:
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


## The sub() Function

The sub() function replaces the matches with the text of your choice:

In [66]:
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the number of replacements by specifying the count parameter:

In [67]:
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


Match Object

A Match Object is an object containing information about the search and the result.

Note: If there is no match, the value None will be returned, instead of the Match Object.

In [68]:
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


### The Match object has properties and methods used to retrieve information about the search, and the result:

.span() returns a tuple containing the start-, and end positions of the match.


In [69]:
# The regular expression looks for any words that starts with an upper case "S":

x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


.string returns the string passed into the function

In [70]:
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


.group() returns the part of the string where there was a match

In [71]:
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


In [75]:
phn="7892-0077-580"

if re.search("\d{4}-\d{3}-\d{3}",phn):
    print("it is a valid number")
else:
    print("incorrect number")
    

incorrect number


In [85]:
email="chandanregins1@gmail.com  kumar217462@gmail.com van.chandan@yahoo.in nayak.layak@gmail.com"

x = re.findall(r"[\w.%+-]+@[\w-]+\.[a-zA-Z]{2,4}", email)
print(x)

['chandanregins1@gmail.com', 'kumar217462@gmail.com', 'van.chandan@yahoo.in', 'nayak.layak@gmail.com']


In [91]:
x=re.findall("[\w.]{0,20}@[\w]{0,15}.[a-zA-Z]{0,3}",email)
print(x)

['chandanregins1@gmail.com', 'kumar217462@gmail.com', 'van.chandan@yahoo.in', 'nayak.layak@gmail.com']


In [92]:
type(x)

list