### Regular Expressions - regex

In [1]:
import re

numRegex = re.compile(r'\d\d-\d\d')     # regex to find two sets of number together separated hy hyphen
mo = numRegex.findall('numbers 65 78 98-09 567-8 77-87')
mo

['98-09', '77-87']

The `findall()` method return list of strings of every match in the searched string and the `search()` method returns the search object of the first matched text.

Adding parentheses will create groups in the regex. The **group()** match object can be used to select the matching text from one group

In [2]:
numRegex = re.compile(r'(\d\d)-(\d\d)')
mo = numRegex.search('numbers 23-56')
print(mo.group())   # prints entire match
print(mo.group(1))  # prints the 1st group
print(mo.group(2))  # prints the 2nd group
print(mo.groups())  # prints all groups

23-56
23
56
('23', '56')


characters have special meaning in regex, like parantheses, so to use them for matching they need to be escaped with `\`  
the following characters have special meanings:  
`. ^ $ * + ? { } [ ] \ ( ) |`

The `|` operator, called the pipe, is used to match multiple expressions.

In [3]:
# if both the epressions occur in search, the first match is returned
orRegex = re.compile(r'apple|orange')
mo1 = orRegex.search('red apples')
mo2 = orRegex.search('oranges and apples')
print(mo1.group())
print(mo2.group())

apple
orange


- `?` is used for optional matching - matches 0 or 1 instance of the optional pattern
- `*` matches 0 or more of the group preceeding the star character
- `+` matches 1 or more of the preceeding group

In [4]:
r = re.compile(r'\d+(\D\D\D\D\D)+')
m = r.search('76576576ABCDE876h76')
print(m.group())
print(m.groups())

76576576ABCDE
('ABCDE',)


`{}`, curly brackets are used to indicate the number of times a group will be repeated in the pattern or the range of repetitions.

In [5]:
print( re.search(r'A{3}', 'abcd xyz AA AAA').group())   # AA will not be matched, only AAA
print( re.findall(r'A{1,3}', 'abcd xyz AA AAA'))
print( re.search(r'A{4,}', 'abcd xyz AAAAA').group())   # matches 4 or more

AAA
['AA', 'AAA']
AAAAA


#### **Greedy and Non-Greedy matching**  
Regular expressions are greedy by default, they match the longest string in ambiguous situations.  
The Non-greedy (lazy) version, **brace/repetition followed by a question mark**  `( {}?, \d+? )`, matches the shortest string possible

In [6]:
print(re.search(r'A{2,10}?', 'AAAAAA').group())

AA


#### Character classes
| character class | represents |
| --------------- | ---------- |
| \d | Any numeric digit from 0 to 9 |
| \D | Any character that is  not a numeric digit |
| \w | any letter, numeric digit or the underscore character |
| \W | any character that is not a letter,digit,underscore |
| \s | any space, tab or newline character |
| \S | any character that is not a space, tab, newline |
  
`[0-4]` character class will match numbers 0 to 4  
`[a-zA-Z]` character class will match alphabets  
`[aeiou]` will match lowercase vowels  
> symbols like `?, *` etc. need not be escaped inside the square brackets.  

placing the caret character `(^)` at the beginning of character calss make it a negative character class. They match all characters except the ones in the class.  
`[^AEIOUaeiou]` will match all characters other than vowels

#### **caret and dollar sign**  
- caret `(^)` at the beginning of the regex means that the match must occur at the beginning of the string.  
- dollar sign `($)` at the end indicates the pattern must occur at the end  
- caret and dollar sign together indicate that the entire string must match the regex

In [7]:
re.search(r'^hello', 'hello world').group()

'hello'

In [8]:
re.search(r'^hello$', 'hello world')==None

True

#### **dot character** - wildcard  
The dot character `(.)` , will match everything except the newline  
one dot matches one character

In [9]:
re.findall(r'A.+D', 'abcd ABCDEF \n A2D A D \n A_D')

['ABCD', 'A2D A D', 'A_D']

#### Regex flags
| Flag | long syntax | Meaning |
| ---- | ----------- | ------- |
| re.A | re.ASCII | Perform ASCII-only matching instead of full Unicode matching |
| re.I | re.IGNORECASE | Perform case-insensitive matching |
| re.M | re.MULTILINE | This flag is used with metacharacter `^` (caret) and `$` (dollar).<br> When this flag is specified, the metacharacter `^` matches the pattern at beginning of the string and each newline’s beginning.<br> And the metacharacter `$` matches pattern at the end of the string and the end of each new line |
| re.S | re.DOTALL | Make the DOT (.) special character match any character at all, including a newline. Without this flag, DOT(.) will match anything except a newline |
| re.X | re.VERBOSE | Allow comment in the regex. This flag is useful to make regex more readable by allowing comments in the regex. |
| re.L | re.LOCALE | Perform case-insensitive matching dependent on the current locale. Use only with bytes patterns |
  

`re.compile()` takes only one flag, so to use multiple flags, the pipe character or `bitwise or` (|) is used.  
`re.compile(r'\w', re.IGNORECASE|re.DOTALL)`
  
Another method is to add the flags.  
`re.compile(r'\w', flags = re.I+re.X)`

In [13]:
re.search(r'''
                apple   # match apple''', '2 APPLES', re.I+re.X).group()

'APPLE'

#### `sub()` method
`sub()` method is used to substitute new text in place of the matched patterns.  
The string to replace argument is followed by string for regular expression.  
The `sub()` method returns the string with the substitutions applied.

  
In the first argument to sub(), we can type, \1 \2 \3 , to mean the 1st,2nd,3rd group matched by the pattern which will not be substituted.

In [11]:
re.sub('apple', '*****', 'apple, orange, mango')

'*****, orange, mango'

In [12]:
re.sub(r'(\w)\d*', r'**\1**', 'A1, B12, K200')

'**A**, **B**, **K**'