# X1. Regular Expressions

[Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto)

[Module Documentation](https://docs.python.org/3.7/library/re.html)

## Topics

* Introduction to Regular Expressions

* Pattern Types

    * Matching Characters

    * The Dot `.`
    
    * Set of Characters: `[…]`
    
    * Complements: `[^…]`
    
    * Repetition Qualifiers
    
    * Anchors
    
    * Alternation
    
    * Extracting Parts
    
    * Iterating Over Matches
    
    * Flags
    
* Practical Examples

# Introduction to Regular Expressions

* Highly specialized programming language

* Specify rules (pattern) for strings you want to match

    * Process only matching strings
    
    * Process parts of matching strings
    
* Examples    

    * `\d+`: a series of digits
    
    * `[a-z]\d+`: a lower case letter followed by some digits
    
    * `.+\.\w+`: a mixture of characters except for new line, followed by a full stop and one of more letters or digits.

## General Approach to Process Text File

```
for each line in a file:

    does the line match the pattern
    
    if it does, do something
```

## Example

* Have a file `names.txt`:

    ```
    Alice
    Bob
    Charlotte
    Derek
    Ermintrude
    Fred
    Freda
    Frederick
    Felicity
    ```
    
* Want to select all `Fred`'s, i.e. lines that contain `Fred`

* In Unix:
    
    * `grep Fred names.txt` (grep file names.txt)

In [2]:
import re                       # for regular expressions

NAMES = 'X1_Files/names.txt'    # file from which we read

pattern = r'Fred'               # define pattern
patobj = re.compile(pattern)    # setup regular expression

with open(NAMES) as names:
    for name in names:
        result = patobj.search(name)    # use pattern object
        if result:                      # see what we got
            print(name, end='')


FileNotFoundError: [Errno 2] No such file or directory: 'X1_Files/names.txt'

### Pieces of the Puzzle

* `pattern = r'...'`: pattern defined as raw string

    * Some patterns are of the form `\c`
    
    * Avoids converting `\c` into special character (`c` some ASCII character)

* `patobj = re.compile(pattern)`

    * `re.compile(pattern)`:
    
        * Function that compiles given `pattern`
        
        * Returns a pattern object

* `result = patobj.search(name)`

    * `patobj.search(string)`:
    
        * Method of pattern object: scans given string for a match to the pattern
        
        * Returns a match object, or None if no match was found
        
    * Alternative (but less general): `patobj.match(string)`
    
        * Method of pattern object: tries to apply pattern at start of given string
        
        * Returns a match object, or None if no match was found           

### Alternative Code

* No compilation of pattern into pattern object

* Use function `re.search(pattern, string)` to obtain match object

    * Alternative (but less general): `re.match(pattern, string)` (similar to `patobj.match(string)`)

In [3]:
pattern = r'Fred'

with open(NAMES) as names:
    for name in names:
        result = re.search(pattern, name) # re.search
        if result:
            print(name, end='')


FileNotFoundError: [Errno 2] No such file or directory: 'X1_Files/names.txt'

# Pattern Types

## Matching Characters

* most characters simply match themself
    
    * Example: `'test'` will match string `abctestdef`
        
* some characters have special meaning
        
    * Metacharacters: `. ^ $ * + ? { } [ ] \ | ( )`
    
    * Escape them with a backslash (`\`) to match them literally
    
        * `'\[test\]'` will match `[test]`

## The Dot `.`

* A dot `.` matches any character but a newline

    * `'....'`: 4 non-newline characters in a row

        * matches `abcd`, `abcdefg`
        
        * does not match: `abc` or `abc\nd`

## Set of Characters: `[…]`

* Characters between `[` and `]` specify a set of characters to match

    * `[aeiou]` matches a vowel
    
    * To match literal `]`, escape it with `\` or place it as first character of the set

        * `[aeiou\]]` and `[]aeiou]` will both match a vowel or `]`.


* Can specify range of characters with `-`, e.g.

    * `[a-z]` matches a character in set `{a, ..., z}`.
    
    * To add literal `-` to the set, escape it with `\` or place it as first or last character of the set

        * `[a\-z]`, `[-az]` and `[az-]` all match `a`, `z` or `-`.
    
* Ranges and single characters can be combined

    * `[a-z_0-9]` matches a character in set `{a, ..., z, _, 0, ..., 9}`.

### Example: Matching `Fred` or `fred`

In [3]:
pattern = r'[Ff]red'         # match Fred or fred
regobj = re.compile(pattern)

with open(NAMES) as names:
    for name in names:
        result = regobj.search(name)
        if result:
            print(name, end='')


Fred
Freda
Frederick
Manfred


## Complements: `[^…]`

* `^` at beginning of a character class creates complement

    * No special meaning if not at beginning

* Complement matches any but those specified characters

* Example: `[^aeiou]` matches any one character but a lower case vowel.

### Predefined Character Sets

* Python comes with predefined character sets

    * `\d`: any decimal digit (equivalent to `[0-9]`)
    * `\D`: complement of `\d`
    * `\s`: any whitespace character (includes `[ \t\n\r\f\v]`)
    * `\S`: complement of `\s`
    * `\w`: any unicode word character as well as digits and the underscore (includes `[a-zA-Z0-9_]`)   
    * `\W`: complement of `\w`

* Predefined character sets (`\d`, `\w`, …) are accepted inside sets

### Special Characters in `[…]`

* All but two metacharacters loose their special meaning inside `[…]`

* Still special are

    * `^`: to specify complement (if specified as first character)
    
    * `\`: 
        * to escape `]`, `-` or `\`
        
        * to signal predefined character sets, e.g. `\d` or `\w`

#### Example: Matching a Date String

* Match a date string of the form `DD/Mon/YYYY`

* Can use the following patterns

    * `\d\d/\w\w\w/\d\d\d\d`

    * `[0-9][0-9]/[A-Z][a-z][a-z]/[0-9][0-9][0-9][0-9]` (a little bit more specific about month)

In [4]:
good_date = "On 13/Dec/2018 at 9:15:00 we discovered planet Python."
bad_date = "On 13/Dec/18 at 9:15:00 we discovered planet Python."

pattern = r'\d\d/\w\w\w/\d\d\d\d'

In [5]:
for string in [good_date, bad_date]:
    result = re.search(pattern, string)
    if result:
        print('match:   ', string)
    else:
        print('no match:', string)

match:    On 13/Dec/2018 at 9:15:00 we discovered planet Python.
no match: On 13/Dec/18 at 9:15:00 we discovered planet Python.


## Repetition Qualifiers

* `\d\d\\d\d` or `[0-9][0-9][0-9][0-9]` is unwieldy
* Use repetition qualifiers to specify repetition
    * `*`: none or more
    * `+`: one or more
    * `?`: none or one
    * `{n}`: `n` times
    * `{m,n}`: `m` to `n` times

Repetition qualifiers apply to previous character, character set or group (see below)

* `Z*`: none or more `Z`s
    * matches: `xxyy`, `xxZyy`, `xxZZZyy`

* `\d+`: one or more digits
    * matches `xx9yy`, `xx12345yy`

* `X?`: none or one `X`
    * matches: `abde`, `abXde`
    
* `A{2}`: two `A`'s in a row

    * matches: `xxAAyy`, `xxAAAAAAyy`

* `[AEIOU]{2,4}`: 2 to 4 upper case vowels in a row

    * matches: `xxAIyy`, `xxAUIyy`, `xxAUIOyy`, `xxAUIEEOyy`

#### Example Revisited: Matching a Date String

* Match a date string of the form `DD/Mon/YYYY`

* Can use the following patterns

    * `\d{2}/\w{3}/\d{4}`
    
    * `[0-9]{2}/[A-Z][a-z]{2}/[0-9]{4}`
    
* Variation: match `DD/Mon/YYYY` or `DD/Mon/YY`:

    * `\d{2}/\w{3}/\d{2,4}`
    
    * `[0-9]{2}/[A-Z][a-z]{2}/[0-9]{2,4}`

### Nested Repetitions

* Possible to use nested repetition qualifier

* Have to use parentheses

    * Syntax: `(?:...)*`, `(?:...)+`, `(?:...){m,n}`, …

* Examples:

    * `(?:a{6})*`: any multiple of six `a`'s in a row

## Anchors

* Anchor: Matching a position in a string

    * `\A`: matches at the start of the string

    * `\Z`: matches at the end of the string

    * `^`: matches at the start of the string (if not in special matching mode)

    * `$`: matches at the end of the string (if not in special matching mode)
    
    * `\b`: matches at word boundary, i.e. at beginning or end of a word
    
        * word defined as sequence of alphanumeric characters, i.e. `[a-zA-Z0-9_]`
        
    * `\B`: opposite of `\b`

### Anchor Examples

* `^\d+$`: integer-like strings
    
    * a non-empty sequence of digits
    
* `^\d+\.\d+$`: float-like strings
    
    * a non-empty sequence of digits followed by a dot and another non-empty sequence of digits
    
* `^\s*\d+\.\d+\s*$`: float-like strings that may be surrounded with white space

* `\bclass\b`: the word class

In [6]:
pattern = r'\bclass\b'

for string in [
    'class',
    'a class of',
    'super-class-power',
    'class9',               # 9 is alphanumeric
    'super_class',          # _ is alphanumeric
    'classy course'
]:
    result = re.search(pattern, string)
    if result:
        print('match:   ', string)
    else:
        print('no match:', string)

match:    class
match:    a class of
match:    super-class-power
no match: class9
no match: super_class
no match: classy course


## Alternation

* Alternation: match one or the other regular expression

* If `A` and `B` are regular expressions, `A|B` will match any string that matches either `A` or `B`.
    
* `|` has very low precedence
    
    * `Crow|Servo` will match either `Crow` or `Servo`, not `Cro`, a `w` or an `S`, and `ervo`.
    
* Use `(?:…)` to be explicit about your alternation

    * Example: `(?:Crow|Servo)`
    
* To match literal `|` use `\|` or `[|]`

### Example: Dates again

With Alternation we can be more precise about dates

In [7]:
winter = "The date 24/Dec/2018 is in Winter."
summer = "The date 24/Jul/2018 is in Summer."

# put alternation of months into parentheses (?:...)
# otherwise we would match any one of these:
#   - \d{2}/Dec 
#   - Jan 
#   - Feb/\d{4}
pattern = r'\d{2}/(?:Dec|Jan|Feb)/\d{4}'

In [8]:
for string in [winter, summer]:
    result = re.search(pattern, string)
    if result:
        print('match:   ', string)
    else:
        print('no match:', string)

match:    The date 24/Dec/2018 is in Winter.
no match: The date 24/Jul/2018 is in Summer.


## Extracting Parts

* Often, a matching substring should be extracted

* Put parts that should match inbetween parentheses `(…)` — not `(?:…)`

* Extract matching substrings with `group` method of match object or `[]` operator, where

    * `match_obj.group(0)` or `match_obj[0]`: is match of whole regular expression
    
    * `match_obj.group(i)` or `match_obj[i]`: is match of `i`'th group
    
* Arguments of method `group` can vary:

    * `match_obj.group()`: equal to `match_obj.group(0)`
    
    * `match_obj.group(i, j, k)`: returns tuple holding corresponding groups
    
* Method `groups()` returns tuple of all groups (group 0 not included)

In [9]:
date_string = "The date 24/Dec/2018 is in Winter."

pattern = r'(\d{2})/(Dec|Jan|Feb)/(\d{4})'

result = re.search(pattern, date_string)
if result:
    print('match:   ', date_string)
    
    # print out each group
    for i in range(4):
        print(f'result.group({i}):', result.group(i))
        print(f'result[{i}]:      ', result[i])

    # default argument of group is 0
    print('result.group():', result.group())
    
    # can specify several arguments
    # result is tuple of corresponding groups
    print('result.group(1,2,1,3):', result.group(1,2,1,3))
    
    # method groups() returns tuple of all groups but group(0)
    print('result.groups():', result.groups())

match:    The date 24/Dec/2018 is in Winter.
result.group(0): 24/Dec/2018
result[0]:       24/Dec/2018
result.group(1): 24
result[1]:       24
result.group(2): Dec
result[2]:       Dec
result.group(3): 2018
result[3]:       2018
result.group(): 24/Dec/2018
result.group(1,2,1,3): ('24', 'Dec', '24', '2018')
result.groups(): ('24', 'Dec', '2018')


### Subgroups

* Possible to have groups within another group

* Group number then corresponds to number of opening parentheses left of group

#### Example

In [10]:
date_string = "The date 24/Dec/2018 is in Winter."

pattern = r'^.*((\d{2})/(Dec|Jan|Feb)/(\d{4})).*$'

result = re.search(pattern, date_string)

print('match:', result.group(0))
print('date: ', result.group(1))
print('day:  ', result.group(2))
print('month:', result.group(3))
print('year: ', result.group(4))

match: The date 24/Dec/2018 is in Winter.
date:  24/Dec/2018
day:   24
month: Dec
year:  2018


### Backreferences

* Backreference: specify that contents of earlier capturing group must be found at current location in the string

* Syntax: `\i`, where `i` is the group number you backrefer to

#### Example

In [11]:
date_string = "The tsetse fly is dangerous."

pattern = r'(\w+)\1'

result = re.search(pattern, date_string)

print('match:   ', result.group(0))
print('1st part:', result.group(1))

match:    tsetse
1st part: tse


### Named Groups

* Python offers naming of groups

* Used to
    
    * backreference group by name

    * retreive group by name

* Syntax

    * Naming group: `(?P<name>…)`
    
    * Backreference group: `(?P=name)`
    
    * Retrieve group by name: `match_obj.group('name')` or `match_obj['name']`


In [12]:
date_string = "The tsetse fly is dangerous."

pattern = r'(?P<part1>\w+)(?P=part1)'

result = re.search(pattern, date_string)

print('1st part:', result.group('part1'))

print('1st part:', result['part1'])


1st part: tse
1st part: tse


### Repetition Qualifiers are Greedy

* Repetition qualifiers are greedy, i.e. they try to match as many characters as possible

* Non-greedy versions of repetition qualifiers available (just append `?` after qualifier)

    * `*?`: none or more, non-greedy
    * `+?`: one or more, non-greedy
    * `??`: none or one, non-greedy
    * `{m,n}?`: `m` to `n` times, non-greedy

#### Example

In [13]:
html_string = '<html><head><title>Title</title>'

# this pattern matches the whole string!
pattern = r'<.*>'

result = re.search(pattern, html_string)

print(result[0])

<html><head><title>Title</title>


In [14]:
html_string = '<html><head><title>Title</title>'

# this pattern matches only the html tag!
pattern = r'<.*?>'

result = re.search(pattern, html_string)

print(result[0])

<html>


## Iterating over matches

* A pattern can match more than once

* Example above: 

    ```python
    html_sting = '<html><head><title>Title</title>'
    pattern = r'<.*?>'
    ```
    
    matches `<html>` `<head>`, `<title>` and `</title>`
    
* `search()` function or method only returns first match

* Use function/method `findall` or `finditer` to obtain all non-overlapping matches

### `re.finditer(pattern, string)` / `patobj.finditer(string)`

Return an iterator over all non-overlapping matches in the string.  For each match, the iterator returns a match object.

Empty matches are included in the result.

In [15]:
html_string = '<html><head><title>Title</title>'
pattern = r'<.*?>'

matches = re.finditer(pattern, html_string)

for match in matches:
    print('matching:', match[0])

matching: <html>
matching: <head>
matching: <title>
matching: </title>


In [16]:
html_string = '<html><head><title>Title</title>'
pattern = r'<.*?>'
patobj = re.compile(pattern)

matches = patobj.finditer(html_string)

for match in matches:
    print('match:', match[0])

match: <html>
match: <head>
match: <title>
match: </title>


### `re.findall(pattern, string)` / `patobj.findall(string)`
    
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Empty matches are included in the result.

In [17]:
html_string = '<html><head><title>Title</title>'
pattern = r'<.*?>'

matching_strings = re.findall(pattern, html_string)

print(matching_strings)

['<html>', '<head>', '<title>', '</title>']


In [18]:
html_string = '<html><head><title>Title</title>'
pattern = r'<.*?>'
patobj = re.compile(pattern)

matching_strings = patobj.findall(html_string)

print(matching_strings)

['<html>', '<head>', '<title>', '</title>']


## Flags

* Flags change matching behaviour

* Specify flags as additional argument of following functions:

    * `re.compile(pattern, flags=0)`
    * `re.match(pattern, string, flags=0)`
    * `re.search(pattern, string, flags=0)`
    * `re.finditer(pattern, string, flags=0)`
    * `re.findall(pattern, string, flags=0)`
    
* Remark: Flags can also be encoded into pattern itself (see documentation)

### Available Flags

* `re.ASCII` / `re.A`

    Make `\w`, `\W`, `\b`, `\B`, `\d`, `\D`
    match the corresponding ASCII character categories.
    
* `re.IGNORECASE` / `re.I`

    Perform case-insensitive matching.
    
* `re.LOCALE` / `re.L`

    Make `\w`, `\W`, `\b`, `\B`, dependent on the current locale.

* `re.MULTILINE` / `re.M`

    `^` matches the beginning of lines (after a newline) as well as the string.

    `$` matches the end of lines (before a newline) as well as the end of the string.

* `re.DOTALL` / `re.D`

    `.` matches any character at all, including the newline.
    
* `re.VERBOSE` / `re.X`

    Ignore whitespace and comments for nicer looking RE's.
    
    * Useful for complex regular expressions
    
    * Allows commenting parts of regular expressions

### Example

In [19]:
pattern = r'fred'

# compile using case insensitive flag
regobj = re.compile(pattern, re.I) 

with open(NAMES) as names:
    for name in names:
        result = regobj.search(name)
        if result:
            print(name, end='')


Fred
Freda
Frederick
Manfred


### Combine Flags

* Flags can be combined

* Each flag's binary pattern has exactly one bit set

* User binary "or" to combine flags

* Example

    ```python
    regobj = re.compile(pattern, re.M | re.I)
    ```

# Practical Examples

## Atoms

* [atoms.log](X1_Files/atoms.log): Output from different computation runs

* Runs without errors:
    
    ```
    RUN 000013 COMPLETED. OUTPUT IN FILE aluminium.dat.
    RUN 000014 COMPLETED. OUTPUT IN FILE silicon.dat.
    RUN 000015 COMPLETED. OUTPUT IN FILE phosphorus.dat.
    ```

* Runs with errors or warnings:
    
    ```
    RUN 000039 COMPLETED. OUTPUT IN FILE yttrium.dat. 1 UNDERFLOW WARNING.
    RUN 000040 COMPLETED. OUTPUT IN FILE zirconium.dat. 2 UNDERFLOW WARNINGS.
    RUN 000058 COMPLETED. OUTPUT IN FILE cerium.dat. ALGORITHM DID NOT CONVERGE AFTER 100000 ITERATIONS.
    ```

* Task: Get lines without warnings or errors

    * Skeleton: [get_successful_runs.py](X1_Files/get_successful_runs.py)
    
    * Solution: [get_successful_runs_solution.py](X1_Files/get_successful_runs_solution.py)

## Hack Attempts

* [messages](X1_Files/messages): Server log file

* Logs that indicate hack attempt:
    
    ```
    Jun 25 23:47:33 noether sshd[9277]: Invalid user account from 207.54.140.124
    Jun 25 23:47:34 noether sshd[9282]: Invalid user adam from 207.54.140.124
    Jul  1 07:41:11 noether sshd[14506]: Invalid user test from 210.51.172.168
    Jul  1 07:41:14 noether sshd[14511]: Invalid user guest from 210.51.172.168
    ```

* Task: Filter hack attempts

    * Skeletons: 
        * [get_hack_attempts.py](X1_Files/get_hack_attempts.py)
        * [get_hack_attempts_verbose.py](X1_Files/get_hack_attempts_verbose.py)
    
    * Solutions: 
        * [get_hack_attempts_solution.py](X1_Files/get_hack_attempts.py)
        * [get_hack_attempts_verbose_solution.py](X1_Files/get_hack_attempts_verbose_solution.py)

## Boiling Temperatures

* [boil.txt](X1_Files/boil.txt): Text file with boiling temperatures of elements

* A line may hold more than one entry:
    
    ```
    Ar 87.3
    Re 5900.0 Ra 2010.0
    K 1032.0 Rn 211.3 Rh 3968.0
    Be 2742.0 Ba 2170.0
    ```

* Task: Extract element names and boiling temperatures

    * Skeleton: [get_boiling_temp.py](X1_Files/get_boiling_temp.py)
    
    * Solution: [get_boiling_temp_solution.py](X1_Files/get_boiling_temp_solution.py)