# Regular Expressions

RegEx are used for pattern matching.

To see if a string contains the specified pattern or not.

In [1]:
import re

In [2]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

In [3]:
if x:
    print("Woah! That's a match!")
else:
    print("Nope. Sorry, nothing")

Woah! That's a match!


<table>
  <thead>
    <tr>
      <th>Character</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>findall</td>
      <td>Returns a list containing all matches</td>
    </tr>
    <tr>
      <td>search</td>
      <td>Returns a Match object if there is a match anywhere in the string</td>
    </tr>
    <tr>
      <td>split</td>
      <td>Returns a list where the string has been at each match</td>
    </tr>
    <tr>
      <td>sub</td>
      <td>Replaces one or many matches with a string</td>
    </tr>
  </tbody>
</table>

## Metacharacters

<table>
  <thead>
    <tr>
      <th>Character</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>[ ]</td>
      <td>A set of characters to match.</td>
    </tr>
    <tr>
      <td>\</td>
      <td>Escape sequence or used to signal a special sequence.</td>
    </tr>
    <tr>
      <td>.</td>
      <td>Any character except new line character.</td>
    </tr>
    <tr>
      <td>^</td>
      <td>Starts with.</td>
    </tr>
    <tr>
      <td>$</td>
      <td>Ends with.</td>
    </tr>  
    <tr>
      <td>*</td>
      <td>Zero or more occurances.</td>
    </tr>
    <tr>
      <td>+</td>
      <td>One or more occurances.</td>
    </tr>
    <tr>
      <td>?</td>
      <td>Zero or one occurances.</td>
    </tr>
    <tr>
      <td>{ }</td>
      <td>Exactly the specified number of occurances.</td>
    </tr>
    <tr>
      <td>|</td>
      <td>Either or</td>
    </tr>
    <tr>
      <td>( )</td>
      <td>Capture and group.</td>
    </tr>
  </tbody>
</table>

In [4]:
txt = "The rain in Spain"
# find all lower case characters from 'a' to 'm' in the string.

x = re.findall("[a-m]", txt)
x

['h', 'e', 'a', 'i', 'i', 'a', 'i']

In [5]:
txt = 'That will be 59 dollars'

# find all digits characters.

x = re.findall("\d", txt)
x

['5', '9']

In [6]:
txt = "hello planet"

# search for a sequence that starts with "he", followed by two (any) characters, and an "o".

x = re.findall("he..o", txt)
x

['hello']

In [7]:
# check whether the string starts with 'hello'.

x =  re.findall('^hello', txt)
bool(x)

True

In [8]:
# check whether the string ends with 'planet'.

x =  re.findall('planet$', txt)
bool(x)

True

In [9]:
# Search for a sequence that starts with 'he', followed by 0 or more (any) characters, and an 'o':

x = re.findall("he.*o", txt)
bool(x)

True

In [10]:
#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":

x = re.findall('he.?o', txt)
bool(x)

False

In [11]:
#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

x = re.findall('he.{2}o', txt)
bool(x)

True

In [12]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x =  re.findall('falls|stays', txt)
bool(x)

True

## Special sequences.

<table>
  <thead>
    <tr>
      <th>Character</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>\A</td>
      <td>Returns a match if the specified characters are at the beginning of the string</td>
    </tr>
    <tr>
      <td>\b</td>
      <td>Returns a match where the specifiend characters are at the beginning or at the end of a word</td>
    </tr>
    <tr>
      <td>\B</td>
      <td>NOT of \b</td>
    </tr>
    <tr>
      <td>\d</td>
      <td>Returns a match where the string contains digits (0-9)</td>
    </tr>
    <tr>
      <td>\D</td>
      <td>NOT of \d</td>
    </tr>  
    <tr>
      <td>\s</td>
      <td>Returns a match where the string contains a white space character.</td>
    </tr>
    <tr>
      <td>\S</td>
      <td>NOT of \s</td>
    </tr>
    <tr>
      <td>\w</td>
      <td>Returns a match where the string contains any word character(a to Z, 0-9, _)</td>
    </tr>
    <tr>
      <td>\W</td>
      <td>NOT of \w</td>
    </tr>
    <tr>
      <td>\Z</td>
      <td>Returns a match if the specified characters are at the end of the string</td>
    </tr>
  </tbody>
</table>

In [13]:
txt = "The rain in Spain"

#Check if the string starts with "The":

x = re.findall('\AThe', txt)
bool(x)

True

In [14]:
#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r'\bain', txt)
bool(x
    )

False

In [15]:
#Check if "ain" is present at the end of a WORD:

x = re.findall(r'ain\b', txt)
x

['ain', 'ain']

In [16]:
#Check if the string contains any digits (numbers from 0-9):

x = re.findall('\d',txt)
bool(x
    )

False

In [17]:
#Check if the string contains any white space character:

x = re.findall('\s', txt)
bool(x
    )

True

## Sets

<table>
  <thead>
    <tr>
      <th>Character</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>[arn]</td>
      <td>Returns a match hwere one of the specifiend characters is present</td>
    </tr>
    <tr>
      <td>[a-n]</td>
      <td>Returns a match for any lower case character between a and n</td>
    </tr>
    <tr>
      <td>[^arn]</td>
      <td>EXCEPT a, r and n</td>
    </tr>
    <tr>
      <td>[0123]</td>
      <td>Returns a match where any specifiend digits are present</td>
    </tr>
    <tr>
      <td>[0-9]</td>
      <td>Returns a match for any digits between 0 and 9</td>
    </tr> 
    <tr>
      <td>[0-5][0-9]</td>
      <td>Returns a match for any two-digit numbers between 00 and 59</td>
    </tr>
    <tr>
      <td>[a-zA-Z]</td>
      <td>Returns a match for any character alphabetically between a and z, lower or upper case</td>
    </tr>
    <tr>
      <td>[+]</td>
      <td>Return a match for any + character in the string</td>
    </tr>
  </tbody>
</table>

In [19]:
from collections import Counter
txt = "The rain in Spain"

#Check if the string has any a, r, or n characters:

x = re.findall('[arn]', txt)
print(x)
Counter(x)

['r', 'a', 'n', 'n', 'a', 'n']


Counter({'r': 1, 'a': 2, 'n': 3})

In [20]:
txt = "8 times before 11:45 AM"

#Check if the string has any two-digit numbers, from 00 to 59:

x = re.findall('[0-5][0-9]', txt)
x

['11', '45']

## The findall() finction.

`findall()` function returns a list containing all matches.

In [21]:
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.</b>

If no matcher are found, an empty list is returned.

## The search() function

The `search()` function searches the string for a match, and returns a `Match object` if there is a match.</br>
If there are more than one match, only the first occurance of the match will be returned.

In [22]:
x= re.search('\s', txt)

print("The first white space charatcer is located in position:",x.start())

The first white space charatcer is located in position: 3


In [23]:
x

<re.Match object; span=(3, 4), match=' '>

### Match object

A match object is an object containing information about the search and the result. </b>
If there is no match, the value None will be returned, instead of Match Object.

In [24]:
print(x)

<re.Match object; span=(3, 4), match=' '>


In [25]:
x.string

'The rain in Spain'

In [26]:
x.group()

' '

In [27]:
x.span()

(3, 4)

In [28]:
x.start()

3

In [29]:
x.end()

4

## The split() Function

The split() function returns a list where the string has been split at each match:

In [30]:
x = re.split('\s', txt)
x

['The', 'rain', 'in', 'Spain']

In [31]:
x = re.split('\s', txt, maxsplit=1)
x

['The', 'rain in Spain']

## The sub() Function

The sub() function replaces the matches with the text of choice.

In [32]:
x =  re.sub('\s','_', txt)
x

'The_rain_in_Spain'