# Regular Expression

[Reference](https://docs.python.org/3/howto/regex.html) 

### metacharacters: `. ^ $ * + ? { } [ ] \ | ( )`

### character class:__ `[ ]`
 - **metacharacters are not active inside classes**
 - **complementing the set using `^` inside character class**

In [1]:
'''
[abc]       A single character of: a, b or c
[a-z]       A character in the range: a to z
[^abc]      A character except: a, b or c; complement of [abc]
[^a-z]      A character not in the range: a-z; complement of [a-z]
[a-zA-Z]    A character in the range: a-z or A-Z

[akm$]      match any of the characters 'a', 'k', 'm', or '$'; 
            '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.
'''
''

''

### Special sequences:

In [2]:
'''
\w        [a-zA-Z0-9_]    : any alphanumeric character
\W        [^a-zA-Z0-9_]   [^\w]

\d        [0-9]
\D        [^0-9]

\s        [ \t\n\r\f\v]   : any whitespace character
\S        [^ \t\n\r\f\v]

.         [^\n]           : Dot-anything except a newline character

'''
''

''

### Repeating things:

 1. __zero or more: `*`__
 2. __one or more: `+`__
 3. __zero or once: `?`__ (marking something as being optional)
 4. __m times: `{m}`__
 5. __m to n times: `{m, n}`__
 6. __zero to m times: `{, m}`__
 7. __m to more: `{m, }`__

In [3]:
'''
ca*t         ct, cat, caaat, ...
ca+t         cat, caat, caaat, ...
home-?brew   homebrew, home-brew

a/{3}b       a///b
a/{2, 5}     a//b, a///b, a////b, a/////b
a/{, 2}      ab, a/b, a//b
a/{2, }      a//b, a///b, a////b, ...

*            {0, }
+            {1, }
?            {0, 1}

'''
''

''

## Using Regular Expresions

### Pattern Object

In pattern object **we can use various usefull regular expression flags.**

In [4]:
import re

p1 = re.compile('ab*')
print(type(p1))
p1

<class 're.Pattern'>


re.compile(r'ab*', re.UNICODE)

In [5]:
p2 = re.compile('ab*', re.IGNORECASE)
p2

re.compile(r'ab*', re.IGNORECASE|re.UNICODE)

### The Backslash Effect

Backslash `\` is used to 
 - Indicate special forms or
 - Allow special characters to be used without invoking their special meaning.
 
**This conflicts with Python's usage to the same character for the same purpose in string literals.** \
Let's say you want to write a RE that matches the string **\section**, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting the the **string `\\section`**. The resulting string that **must be passed to `re.compile()` must be `\\section`.** \

**However, to express this as a Python string literal, both backslashes must be escaped again.**

| Characters    |      Stage                                |
|---------------|:------------------------------------------|
| \section      |  Text string to be matched                |
| \\section     |  Escaped backslash for `re.compile()`     |
| "\\\\section" |  Escaped backslashes for a string literal |

In short, to match literal backslash, you have to write `\\\\` as a RE string, because the rgular expression must be `\\` and each backslash must be expressed as `\\` inside a regular Python string literal. This leads to lots of repeated backslashes and makes the resulting strings difficult to understand.

__Solution: Use Python's raw string notation for regular expresions__; backslashes are not handled in any special way in a string literal prefixed with `r`. eg.: __r"\n"__ to literally select `\` and `n` not a newline character.

In addition, special escape sequences that are valid in regular expressions, but not valid as Python string literals, not result in a `DeprecationWarning` and will eventually become a `SyntaxError` which means the sequences will be invalid if raw string notation or escaping the backslashes isn't used.

| Regular String   |  Raw string    |
|------------------|:---------------|
| "ab*"            |  r"ab*"        |
| "\\\\section"    |  r"\\section"  |
| "\\w+\\s+\\1"    |  r"\w+\s+\1"   |

### Performing Matches

| Method/Attribute |  Porpose                                                                    |
|------------------|:----------------------------------------------------------------------------|
| `match()`        |  Determine if the RE matches at the beginning of the string.                |
| `search()`       |  Scan through a string, looking for any location where this RE matches.     |
| `findall()`      |  Find all substrings where the RE matches, and returns them as a list.      |
| `finditer()`     |  Find all substrings where the RE matches, and returns them as an iterator  |

In [5]:
import re

# Creating RE Object
ptrn = re.compile('[a-z]+')
print(type(ptrn))
ptrn

<class 're.Pattern'>


re.compile(r'[a-z]+', re.UNICODE)

In [7]:
# Matching against Blank String
print(ptrn.match(""))
print(ptrn.search(""))
print(ptrn.findall(""))
print()

print(ptrn.finditer(""))
print(list(ptrn.finditer("")))

None
None
[]

<callable_iterator object at 0x0000024D36B1BB20>
[]


In [8]:
# Matching against 'tempo'
print(ptrn.match('tempo'))
print(ptrn.search('tempo'))
print(ptrn.findall('tempo'))
print()

print(ptrn.finditer('tempo taxi'))
print(list(ptrn.finditer('tempo taxi')))

<re.Match object; span=(0, 5), match='tempo'>
<re.Match object; span=(0, 5), match='tempo'>
['tempo']

<callable_iterator object at 0x0000024D36B20EE0>
[<re.Match object; span=(0, 5), match='tempo'>, <re.Match object; span=(6, 10), match='taxi'>]


### re.Match object

| Method/Attribute | Purpose                                                           |
|------------------|:------------------------------------------------------------------|
| `group()`        | Return the string matched by the RE                               |
| `start()`        | Return the starting position of the match                         |
| `end()`          | Return the ending position of the match                           |
| `span()`         | Return a tuple containing the (start, end) positions of the match |

In [13]:
ptrn = re.compile('[a-z]+')
mtch = ptrn.match('maxtor desney hollywood')
print(mtch)
print()

print(mtch.group())
print(mtch.start())
print(mtch.end())
print(mtch.span())

<re.Match object; span=(0, 6), match='maxtor'>

maxtor
0
6
(0, 6)


In [14]:
mtch = ptrn.match('::: maxtor disney hollywood')
print(mtch)

None


In [15]:
mtch = ptrn.search('::: maxtor disney hollywood')
print(mtch)
print()

print(mtch.group())
print(mtch.start())
print(mtch.end())
print(mtch.span())

<re.Match object; span=(4, 10), match='maxtor'>

maxtor
4
10
(4, 10)


In [16]:
mtch = ptrn.findall('::: maxtor disney hollywood')
print(mtch)

['maxtor', 'disney', 'hollywood']


In [17]:
mtch = ptrn.finditer('::: maxtor disney hollywood')
print(mtch)
print()

for m in mtch:
    print(f'group-> {m.group()}, span->{m.span()}')

<callable_iterator object at 0x0000024D36C1F160>

group-> maxtor, span->(4, 10)
group-> disney, span->(11, 17)
group-> hollywood, span->(18, 27)


### Example: show all position of digits in a given string

In [38]:
astr = 'Use digits (e.g. pages: 56–74, 115–117; years: 1864–1899, 1998–2008; streets 36–99 Spa St).'
ptrn = re.compile(r'\d+')

print(ptrn.findall(astr))
print()

for p in ptrn.finditer(astr):
    print(p.span())

['56', '74', '115', '117', '1864', '1899', '1998', '2008', '36', '99']

(24, 26)
(27, 29)
(31, 34)
(35, 38)
(47, 51)
(52, 56)
(58, 62)
(63, 67)
(77, 79)
(80, 82)


## Module-Level Functions

You don't have to create a pattern object and call its methods; the `re` module also provide top-level functions called `match()`, `search()`, `findall()`, `sub()` and so forth. These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either `None` or a `match object` instance.

__Which is Better?__ module-level functions, or get the pattern and call its method yourself. \
If you're accesssing a regex within a loop, pre-compiling it will save a few function calls.

In [41]:
print(re.match(r'From\s+', 'Fromage amk'))

None


In [42]:
re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')

<re.Match object; span=(0, 5), match='From '>

## Compilation Flags

 - Modify some aspects of how regular expressions work.
 - Available in two names, a long name as `IGNORECASE` and short, one-letter from such as `I`
 
| Flag               |  Meaning                                                                                                           |
|:-------------------|:-------------------------------------------------------------------------------------------------------------------|
| `ASCII`, `A`       |  Makes several escapes like `\w`, `\b`, `\s` and `\d` match only on ASCII characters with the respective property. |
| `DOTALL`, `S`      |  Make `.` match any character, including newlines.                                                                 |
| `IGNORECASE`, `I`  |  Do case-insensitive matches. |
| `LOCALE`, `l`      |  Do a locale-aware match. |
| `MULTILINE`, `M`   |  Multi-line matching, affecting `^` and `$`. |
| `VERBOSE`, `X`     |  Enable verbose REs, which can be organized more cleanly and understandably. |


### Comments inside Regular Expression

In [44]:
# See how much easier it is to read
charref = re.compile(r"""
&[#]             # Start of a numeric entity reference
(
   0[0-7]+       # Octal form
 | [0-9]+        # Decimal form
 | x[0-9a-fA-F]+ # Hexadecimal form
)
;                # Trailing semicolon
""", re.VERBOSE)

In [45]:
# Without verbose setting, the RE would look like this:
charref = re.compile("&#(0[0-7]+"
                     "|[0-9]+"
                     "|x[0-9a-fA-F]+);")

## Example: Stressful Subject

Sofia has had a very stressful month and decided to take a vacation for a week. To avoid any stress during her vacation she wants to forward emails with a stressful subject line to Stephen.

The function should recognise if a subject line is stressful. 

__A stressful subject line:__
 - all letters in uppercase and/or
 - ends by at least 3 exclamation marks and/or
 - contains at least one of the following words: "help", "asap", "urgent"
     + red words can be spelled in different ways: \
       "HTLP", "help", "HeLp", "H!E!L!P!", "H-E-L-P" \
       "HHHEEEEEEEEELLP" (they just can't have any other \
       letters interspersed between them.)


In [9]:
# Method: 1
import re
def is_stressful(subj):
    test1 = subj.isupper()
    test2 = bool(re.search(r'h+[^a-z]?e+[^a-z]?l+[^a-z]?p|!!!$', subj.lower())) or \
            bool(re.search(r'a+[^a-z]?s+[^a-z]?a+[^a-z]?p|!!!$', subj.lower())) or \
            bool(re.search(r'u+[^a-z]?r+[^a-z]?g+[^a-z]?e+[^a-z]?n+[^a-z]?t|!!!$', subj.lower()))
    return test1 or test2

In [10]:
# Method: 2
def is_stressful(subj):
    test1 = subj.isupper() or subj.endswith('!!!')
    
    red = 'help asap urgent'.split()
    red_re = [''.join(p + '+[^a-z]?' for p in word) for word in red]
    test2 = any(re.search(r, subj.lower()) for r in red_re)
    
    return test1 or test2

In [11]:
# Tests
print(is_stressful("Hi") == False)
print(is_stressful("I neeed HELP") == True)
print(is_stressful("I neeed help") == True)
print(is_stressful("I neeed HeLp") == True)
print(is_stressful("I neeed H!E!L!P!") == True)
print(is_stressful("I neeed H-E-L-P") == True)
print(is_stressful("I neeed HHHEEEEEEEEELLP") == True)
print(is_stressful("I neeed H-E-t-L-P") == False)
print(is_stressful("Heeeeeey!!! I’m having so much fun!\”") == False)

True
True
True
True
True
True
True
True
True
