# Regular Expressions

In the last session we tried to interpret strings as valid heights and weights. This involved looking for text such as "meter" or "kilogram" in the string, and then extracting the number. This process is called pattern matching, and is best undertaken using a regular expression.

Regular expressions have a long history and are available in most programming languages. Python implements a standards-compliant regular expression module, which is called `re`.

In [1]:
import re

In [2]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.5/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

Let's create a string that contains a height and see if we can use a regular expression to match that...

In [3]:
h = "2 meters"

To search for string "meters" in a string, using `re.search`, e.g.

In [4]:
if re.search("meters", h):
    print("String contains 'meters'")
else:
    print("No match")

String contains 'meters'


`re.search` returns a match object if there is a match, or `None` if there isn't.

In [5]:
m = re.search("meters", h)

In [6]:
m

<_sre.SRE_Match object; span=(2, 8), match='meters'>

This matches "meters", but what about "meter". "meter" is "meters" without an "s". You can specify that a letter is optional using "*"

In [7]:
h = "2 meter"

In [8]:
m = re.search("meter(s*)", h)

In [9]:
m

<_sre.SRE_Match object; span=(2, 7), match='meter'>

However, "*" means to match 0 or more, so this would match "meter", "meters" and "metersssssss". To match the "s" 0 to 1 times, we need to use "{0,1}"

In [10]:
h = "2 meter"

In [11]:
m = re.search("meters{0,1}", h)

In [12]:
m

<_sre.SRE_Match object; span=(2, 7), match='meter'>

However, this has still not worked, as we match "meters" in the middle of the string. We need to match "meters" only at the end of the string. We do this using "$", which means match at end of string

In [13]:
m = re.search("meters{0,1}$", h)

In [14]:
m

<_sre.SRE_Match object; span=(2, 7), match='meter'>

We also want to be able to match the string "X m". To do this, we need to use the "or" operator, which is "|". It is a good idea to use round brackets to make both sides of the "or" statement clear.

In [15]:
h = "2 m"

In [16]:
m = re.search("(m$)|(meters{0,1}$)", h)

In [17]:
m

<_sre.SRE_Match object; span=(2, 3), match='m'>

For this to be valid, we need to match "X meters", where "X" is a number. You can use "\d" to represent any number. For example

In [18]:
h = "2 meters"

In [19]:
m = re.search("(\d) ((m$)|(meters{0,1}$))", h)

In [20]:
m

<_sre.SRE_Match object; span=(0, 8), match='2 meters'>

A problem with the above example is that it only matches a number with a single digit, as "\d" only matches a single number. To match one or more digits, we need to put a "+" afterwards, as this means "match one or more", e.g.

In [21]:
h = "1.8 meters"

In [22]:
m = re.search("(\d+) ((m$)|(meters{0,1}$))", h)

In [23]:
m

<_sre.SRE_Match object; span=(2, 10), match='8 meters'>

This match breaks if the number is has decimal point, as it doesn't match the "\d". To match a decimal point, you need to use "\\.", and also "*", which means "match 0 or more"

In [24]:
h = "1 meters"

In [25]:
m = re.search("(\d+\.*\d*) ((m$)|(meters{0,1}$))", h)

In [26]:
m

<_sre.SRE_Match object; span=(0, 8), match='1 meters'>

The number must match at the beginning of the string. We use "^" to mean match at start...

In [27]:
h = "some 1.8 meters"

In [28]:
m = re.search("^(\d+\.*\d*) ((m$)|(meters{0,1}$))", h)

In [29]:
m

Finally, we want this match to be case insensitive, and would like the user to be free to use as many spaces as they want between the number and the unit, before the string or after the string... To do this we use "\s*" to represent any number of spaces, and match using `re.IGNORECASE`.

In [30]:
h = "   1.8 METers   "

In [31]:
m = re.search("^\s*(\d+\.*\d*)\s*((m)|(meters{0,1}))\s*$", h, re.IGNORECASE)

In [32]:
m

<_sre.SRE_Match object; span=(0, 16), match='   1.8 METers   '>

The round brackets do more than just separate the parts of your search. They also allow you extract the parts that match.

In [33]:
m.groups()

('1.8', 'METers', None, 'METers')

As `m.groups()[0]` contains the match of the first set of round brackets (which is the number), then we can get the number using `m.groups()[0]`. This enables us to rewrite the `string_to_height` function from the last section as;

In [34]:
def string_to_height(height):
    """Parse the passed string as a height. Valid formats are 'X m', 'X meters' etc.""" 
    m = re.search("^\s*(\d+\.*\d*)\s*((m)|(meters{0,1}))\s*$", height, re.IGNORECASE)
    
    if m:
        return float(m.groups()[0])
    else:
        raise TypeError("Cannot extract a valid height from '%s'" % height)

In [35]:
h = string_to_height("   2.5    meters   ")

In [36]:
h

2.5

# Exercise

## Exercise 1

Rewrite your `string_to_weight` function using regular expressions. Check that it responds correctly to a range of valid and invalid weights.

##Â Exercise 2

Update string_to_height so that it can also understand heights in centimeters, and update string_to_weight so that it can also understand weights in grams. Note that you may find it easier to separate the number from the units. You can do this using "\w" to match any word character, and the below code

In [None]:
def get_number_and_unit(s):
    m = re.search("^\s*(\d+\.*\d*)\s*(\w+)\s*$", s, re.IGNORECASE)

    if m:
        number = float(m.groups()[0])
        unit = m.groups()[1].lower()
        return (number, unit)
    else:
        raise TypeError("Cannot extract a valid 'number unit' from '%s'" % s)       