A RegEx, or Regular Expression, is a special sequence of characters that forms a search pattern.

RegEx can be used to check if a **string** contains the specified search pattern.

RegEx can also split a pattern into one or more sub patterns.

Python, provides a re module whose primary function is to offer a search, where it takes a regular expression and a string.



In [3]:
import re

In [4]:
txt="The rain in Spain"
x=re.search("^The.*Spain$", txt)

In [5]:
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

### RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

findall: Returns a list containing all.

search: Returns a Match object if there is a match anywhere in the string.

split: Returns a list where the string has been split at each match.

sub: Replaces one or many matches with a string.

### Metacharacters

Metacharacters are characters with a special meaning:

In [31]:
# []	A set of characters

txt = "The rain in Spain"

#Find all lower case characters alphabetically between "a" and "m":

x = re.findall("[a-m]", txt)
print(x)

y=re.search("[a-m]", txt)
print(y)


['h', 'e', 'a', 'i', 'i', 'a', 'i']
<re.Match object; span=(1, 2), match='h'>


"" \ "" : Signals a special sequence (can also be used to escape special characters)

In [8]:
txt = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", txt)
print(x)

['5', '9']


"". ""	Any character (except newline character)

In [9]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":

x = re.findall("he..o", txt)
print(x)


['hello']


"" ^ ""	Starts with

In [10]:
txt = "hello planet"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")


Yes, the string starts with 'hello'


""$ ""	Ends with

In [14]:
x = re.findall("planet$", txt)
#Check if the string ends with 'planet':
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")

Yes, the string ends with 'planet'


"" * ""  Zero or more occurrences

In [22]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*o", txt)

print(x)

['hello']


"" + ""	One or more occurrences

In [18]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)

print(x)

['hello']


"" ? ""	Zero or one occurrences

In [25]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall("he.?o", txt)

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

[]


"" {} ""	Exactly the specified number of occurrences

In [26]:
#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall("he.{2}o", txt)

print(x)

['hello']


"" | ""	Either or

In [27]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['falls']
Yes, there is at least one match!


## Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

**\A** :	Returns a match if the specified characters are at the beginning of the string

In [32]:
txt = "The rain in Spain"

#Check if the string starts with "The":

x = re.findall("\AThe", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")


['The']
Yes, there is a match!


**\b** : Returns a match where the specified characters are at the beginning or at the end of a word.


In [33]:
#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[]
No match


In [35]:
#Check if "ain" is present at the End of a WORD:
x=re.findall(r"ain\b", txt)
print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['ain', 'ain']
Yes, there is at least one match!


**\B** :	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word

In [36]:
#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


In [37]:
#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r"ain\B", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


**\d** : Returns a match where the string contains digits (numbers from 0-9)

In [38]:
#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


**\D** : Returns a match where the string DOES NOT contain digits

In [39]:
#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**\s** :	Returns a match where the string contains a white space character

In [40]:
#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


**\S** :	Returns a match where the string DOES NOT contain a white space character

In [41]:
#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**\w** :	Returns a match where the string contains any word characters

In [42]:
#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**\W** :	Returns a match where the string DOES NOT contain any word characters

In [43]:
#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', ' ', ' ']
Yes, there is at least one match!


**\Z** :	Returns a match if the specified characters are at the end of the string	"Spain\Z"

In [47]:
#Check if the string ends with "Spain":

x = re.findall("Spain\Z", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")


['Spain']
Yes, there is a match!
