# Regular expressions

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

To a large extent writing RE is language independent (or a language on its own)

Widely used in language processing

# Simple patterns

- To start with we will use <a href="http://regexr.com/" target="_blank">http://regexr.com/</a> to test our regular expressions


- most characters will simply match themselves. The regular expression ``test`` will match the string ``test`` (by default the match is case sensitive)


- the meta-character ``.`` (dot) will match any **single** character except the new line character

- the meta-characters ``[`` and ``]`` will indicate a set of characters to match
     - can either enumerate the characters individually ``[abcd]`` 
     - can indicate a range ``[a-d]``
     - meta-characters listed inside ``[`` and ``]`` lose their special nature and are treated as simple characters. e.g. ``[ab.]`` matches ``a``, ``b`` or ``.``
     - ``^`` will indicate which characters not to match if it appears first after ``[`` e.g. ``[^a-c]`` will match anything but ``a``, ``b`` or ``c``. 

# Special sequences

- ``\d`` Matches any decimal digit; this is equivalent to the class ``[0-9]``.

- ``\D`` Matches any non-digit character; this is equivalent to the class ``[^0-9]``.

- ``\s`` Matches any whitespace character; this is equivalent to the class ``[ \t\n\r\f\v]``.

- ``\S`` Matches any non-whitespace character; this is equivalent to the class ``[^ \t\n\r\f\v]``.

- ``\w`` Matches any alphanumeric character; this is equivalent to the class ``[a-zA-Z0-9_]``.

- ``\W`` Matches any non-alphanumeric character; this is equivalent to the class ``[^a-zA-Z0-9_]``.

These sequences can be included inside a character class. For example, ``[\s,.]`` will match any whitespace character, or ``,`` or ``.``.

# Repeating things

- ``*`` repeats an expression 0 or unspecified number of times e.g. ``a*`` matches a sequence of 0 or many letters a

- ``+`` repeats an expression 1 or more times e.g. ``a+`` matches a sequence of 1 or many letters a

- ``?`` repeats an expression 0 or 1 times. Indicates something optional. e.g. ``home-?brew`` matches either ``homebrew`` or ``home-brew``.

- ``{m,n}`` where m and n are integer repeats an expression at least m times and at most n times. If ``m`` is missing it be considered 0. If ``n`` is missing it is considered unlimited. 

- ``|`` is the *or* operator. It has very low priority. ``tea|coffee`` will match *tea* or *coffee*, but not *te*, followed by *a* or *c*, followed by *offee*. 

# Boundaries

- ``^`` matches the beginning of the line
- ``$`` matches the end of the line
- ``\b`` word boundary, where words are defined as a sequence of alphanumeric characters. It is a zero-with assertion (i.e. no actual character is matched)
- ``\B`` negation of ``\b``: the current position is not a word boundary

# Groups

- Groups are marked by ``(`` and ``)``. They allow to set the priority of matching and retrieve specific parts of the matched string. 
- It is possible to refer to a group by using ``\1``, ``\2``. **Note**: counting starts from **1** and you need to count the number of ``(`` opened. 
- It is possible to name a group by using ``(?P<name>...)``
- It is possible to refer to a named group by using ``(?P=name)``

# Exercises with regular expressions

- write a regular expression that matches:
    - an email address from DCU
    - checks whether a string is an integer
    - identifies the domain name and the path from a URL
    - identifies a date (the format of a date is deliberately not specified at this stage. We are going to do this again later on)
    - checks whether a string is a valid IP address
    
- play with <a href="https://alf.nu/RegexGolf">https://alf.nu/RegexGolf</a>
    

# Using regular expressions in python

In order to use regular expressions in python you need to import the ``re`` module 

A regular expression needs to be first compiled

```python
p = re.compile("ab+c")
```

The function returns an object pattern which is used later in matching.

It is possible to control the behaviour of matching using flags, specified as a parameter of compile:
- **re.I** or **re.IGNORECASE**: performs case insensitive matching
- **re.S** or **re.DOTALL**: Makes the ``.`` special character match any character at all, including a newline; without this flag, ``.`` will match anything except a newline.
- **re.M** or **re.MULTILINE**: control how ``^`` and ``$`` when the string to match contains several lines.
- **re.X** or **re.VERBOSE**: allows to write more readable regular expressions. Space which are not in ``[]`` are ignored. ``#`` can be used for comments

# Matching in a string

- ``match()``: determines if the RE matches at the beginning of the string
- ``search()``: scans through a string looking for any location where the RE matches

Both functions return ``None`` if no match can be found. If successful a *match object* is returned. 

- ``findall()``: finds all substrings where RE matches, and returns them as a list

The methods can be applied to a pattern or they can be used as ``re.<function>(<pattern>, <string>, <flags>)``


In [1]:
import re

number_patter = re.compile("[0-9]+")
print(number_patter.match("abcd"))

None


In [2]:
print(number_patter.match("1234"))

<_sre.SRE_Match object; span=(0, 4), match='1234'>


In [3]:
m = number_patter.match("1234")
print("Match:", m.group(), " Start:", m.start(), " End:", m.end(), sep="")

Match:1234 Start:0 End:4


In [4]:
re.findall("(([a-z])\\2)", "mississippi")

[('ss', 's'), ('ss', 's'), ('pp', 'p')]

# Match objects

- returned by match and search
- contain information about matching:
    - ``group()`` gives information about the full match
    - ``group(<group no/name>)`` allow accessing the matches groups if they exist, error otherwise

In [5]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")

In [6]:
m.group('first_name')

'Malcom'

In [7]:
m.group('last_name')

'Reynolds'

In [8]:
m.group(1)

'Malcom'

In [9]:
m.group(2)

'Reynolds'

# Functions to modify strings

- ``re.split(<pattern>, <string>)`` splits a string using a pattern
- ``re.sub(<pattern>, <repl>, <string>[, <count>])`` substitutes a <pattern> in a <string> with <repl>

In [10]:
# split a string into words by using non alphanumeric characters as separators
print(re.split('\\W+', 'This is a test, short and sweet, of split().'))

['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']


In [11]:
# creating a group also returns the bondaries
print(re.split('(\\W+)', 'This is a test, short and sweet, of split().'))

['This', ' ', 'is', ' ', 'a', ' ', 'test', ', ', 'short', ' ', 'and', ' ', 'sweet', ', ', 'of', ' ', 'split', '().', '']


# Greedy march vs Non-Greedy

By default matching is greedy.

In [12]:
s = '<html><head><title>Title</title>'

In [13]:
print(re.match('<.*>', s).group())

<html><head><title>Title</title>


In [14]:
# non-greedy search
print(re.match('<.*?>', s).group())

<html>


# Raw strings

If we need to match a ``\`` in the text our RE gets complicated.

In [15]:
s = "\section"
print(re.match("\section", s))

None


In [16]:
print(re.match("\\section", s))

None


In [17]:
print(re.match("\\\\section", s))

<_sre.SRE_Match object; span=(0, 8), match='\\section'>


In [18]:
print(re.match(r"\\section", s))

<_sre.SRE_Match object; span=(0, 8), match='\\section'>


In [19]:
# but not this form
print(re.match(r"\section", s))

None


Regular expressions are usually written in python using raw string notation.

It makes the expressions easier to read: ``"\\d+\.\\d+"`` vs. ``r"\d\.\d+"``

# Exercises

- Write a program that converts Roman numerals to Arabic numerals using regular expressions
- Write as many regular expressions you can to capture temporal expressions in the given list of questions

# Further reading

- Regular Expression HOWTO: <a href="https://docs.python.org/3/howto/regex.html" target="_blank">https://docs.python.org/3/howto/regex.html</a>
- RE module documentation: <a href="https://docs.python.org/3/library/re.html" target="_blank">https://docs.python.org/3/library/re.html</a>
- RE examples: <a href="https://www.tutorialspoint.com/python/python_reg_expressions.htm" target="_blank">https://www.tutorialspoint.com/python/python_reg_expressions.htm</a>
- https://regexcrossword.com/
- Language independent tutorial about regular expressions
<a href="https://github.com/zeeshanu/learn-regex">https://github.com/zeeshanu/learn-regex</a>