# Regular Expressions

<pre>
B. von Konsky
Data Management ISYS5007
School of Information Systems, Curtin University
</pre>

This is a [Jupyter Notebook](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) [1] that gives an overview of regular expressions in Python.  It shows how formatted text and Python code can be interleaved in the context of a single document.

Effectively this is a means that can be used to achieve [reproducible research](http://biostatistics.oxfordjournals.org/content/10/3/405.full), since code and data can be integrated into a single document and deployed in a variety of formats including:
* HTML
* PDF
* Presentations

Text that is not code is fomated using [Markdown](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html).

**References**

\[1\] What is a Jupyter Notebook?[http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html)

\[2\] Peng, R. D. (2009). "Reproducible research and Biostatistics." Biostatistics 10(3): 405-408. [http://biostatistics.oxfordjournals.org/content/10/3/405.full](http://biostatistics.oxfordjournals.org/content/10/3/405.full)

\[3\] Jupyter Notebook. Markdown Cells. [http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html)

---

## Special characters

Special notation used to define patterns in text
* Pattern matching
* Pattern replacement

Data Cleaning
* Ensure data is in a consistent format
* Break long items into fields
* (e.g. address becomes street, suburb, postcode

Python regular expressions
* Standard Python package, so nothing to install
* import re

## Special characters in pattern matching

|Special Chars | Meaning                                                     |
|:------------:|-------------------------------------------------------------|
| .            | Matches any character                                       |
| ^            | Matches the beginning of the string                         |
| $            | Matches newline or end of the string                        |
| *            | Matches zero or more of the preceding regular exressions    |
| +            | Matches one or more of the previous regular expressions     |
| {m}          | matches exactly m copies of the previous regular expression |
| {m,n}        | Matches between m and n copies of the previous expression   |
| \            | Escapes special characters or used for pecial sequences     |
| [ ]          | Used to indicate a set of characters, e.g. [0-9], [A-Z]     |
| a&#124;b     | Mathces regular expressions a or b                          |
| (...)        | Matches a group that can be referred to by number later     |

## Shorthand for common character sets

|Shorthand   | Meaning                                                     |
|:----------:|-------------------------------------------------------------|
| \w         | World characters, equivalent to [a-zA-Z0-9_]                |
| \W         | Opposite of \w, equivalent to [^a-zA-Z0-9_] (Not in the set)|
| \d         | Equivalent to [0-9]                                         |
| \D         | Equivalent to [^0-9]    (Not in the sent 0-9)               |

## Usefule re object methods

|Method      |Meaning 
|------------|----------------------------------------------------------------------------------------------------|
|match = re.search(pattern, string) | Returns a match object when the pattern is in the string and None otherwise |
|match.group()                      | Text matching the pattern                                                   |
|match.group(n)                     | Group n in the matched pattern                                              |
|str=re.sub(pattern, repl, string)  | Matched pattern replaced with repl in string                                |

... And many more
      






# Matching an email address in a string

* **\w** matches word characters plus underscore
* **\.** Escaped full-stop is a full-stop
* **[\w\.]+** matches one ore more word characters and full stops

In [1]:
import re

# Match the email address in the string
details="john smith: j.smith2@bigpond.com.au, (08) 9266-0000"
result = re.search(r'[\w\.]+@[\w\.]+', details)
result.group()

'j.smith2@bigpond.com.au'

## Matching address components

In [2]:
# Match components of the address in the string
address  = '1313 Catalano Bay Drive, CANNING VALE WA 6155'
number   = re.search(r'^\s*(\d+)', address)
number.group(1)

'1313'

In [3]:
>>> street   = re.search(r'([a-zA-Z\s]+)', address)
>>> street.group(1)

' Catalano Bay Drive'

In [4]:
suburb   = re.search(r'([a-zA-Z\s]+)\s+WA', address)
suburb.group(1)

' CANNING VALE'

In [5]:
postcode = re.search(r'([0-9]{4})\s*$', address)
postcode.group(1)

'6155'

## Matching addresses with Unit / Number

* Use multiple groups to detect but unit and street number
* Look for one ore more numbers on either side of **‘/’**


In [6]:
# Match  unit number and street address if they exist
address = '4/2367 Hay Street, PERTH WA 6000'
result  = re.search(r'^\s*(\d+)\s*/\s*(\d+)', address)
result.group(1)

'4'

In [7]:
result.group(2)

'2367'

In [8]:
result.groups()

('4', '2367')

## Code to detect both address formats
* Check if result for first regular expression is **not None**
* Use other regular expression otherwise

In [9]:
def UnitAndNumber(address):
    result  = re.search(r'^\s*(\d+)\s*/\s*(\d+)', address)
    # if result is not None the this address as Unit/Number format
    if result is not None:
        unit   = result.group(1)
        number = result.group(2)
    # Otherwise assume the address uses Number Street format
    else:
        unit   = None
        number = re.search(r'^\s*(\d+)', address).group(1)
    return unit, number

In [10]:
address='1313 Catalano Bay Drive, CANNING VALE WA 6155'
UnitAndNumber(address)

(None, '1313')

In [11]:
address='42/1313 Mockingbird Lane'
UnitAndNumber(address)

('42', '1313')

## Substitutions using groups

* result =re.sub(patter, replacement, string)
* Replacement can refer to groups in the pattern by number

In [12]:
str = "Telephone: 9266 7278 or       X 6104"
# Replace 'or' surrounded by any number of spaces with a comma
st = re.sub(r'\s+or\s+', ', ', str)
st

'Telephone: 9266 7278, X 6104'

In [13]:
# Replace "x" for extension with a known prefix
st = re.sub(r'[xX]\s*([0-9]{4})', r'9266 \1', str)
st

'Telephone: 9266 7278 or       9266 6104'

In [14]:
# Replace hyphen surrounded by zero or more spaces with a single space
st = "Telephone: 9266    -  7279"
st = re.sub(r'([0-9]{4})\s*-*\s*([0-9]{4})', r'\1 \2', str)
st

'Telephone: 9266 7278 or       X 6104'