# Regex

  * Basic regex escapes
  * Usage of raw strings
  * Available flags
  * Basic API

Great [reference document](https://docs.python.org/3/howto/regex.html).

## Regex Special Matches
List of most common special matches -
  * `\d`: Matches any digit, same as `[0-9]`.
  * `\D`: Matches anything but a digit, same as `[^0-9]`.
  * `\s`: Matches any whitespace character, same as `[ \t\n\r\f\v]`.
  * `\S`: Matches any non-whitespace character, same as `[^ \t\n\r\f\v]`.
  * `\w`: Matches any alphanumeric character, same as `[a-zA-Z0-9]`.
  * `\W`: Matches any non-alphanumeric character, sames as `^[a-zA-Z0-9]`.
  * `\b`: A boundary specifier that matches word boundaries. So `\bat\b` matches `"at"`, `"at."`, `"(at)"`, `"as at ay"`, but not `"attempt"`, or `"atlas"`.
  * `\B`: A boundary specifier that matches non-word boundaries. So `at\B` matches `"athens"`, `"attorney"`, `"atom"`, but not `"at"`, `"at."`.

## Raw Strings
Regexs make liberal use of backslashes but Python treats backslashes in strings as special, e.g., `"\n"` is a single newline character. This means that regex backslashes have to be escaped in regular Python strings, so a regex like `\w\-\d` will have to be written as `"\\w\\-\\d"` which can quickly become unweildly. However, if I prefix a string with an `r`, I am telling Python to not treat backslashes as special. `r"\n"` is just two characters - a backslash followed by the letter n. Now regexs can be written as-is.

## Flags
Some common flags -

| Flag | Meaning |
|------|------------|
| `DOTALL` or `S` | Makes the `.` character match newlines as well |
| `IGNORECASE` or `I` | Ignores the case |
| `MULTILINE` or `M` | Multiline matching useful when using `^` and `$` |

## APIs
  * `match`: find the pattern if it is at the beginning of the string.
  * `search`: find the first occurence of the pattern anywhere in the string.
  * `findall`: finds all occurences of the pattern starting anywhere in the string.
  * `finditer`: same as above, except yields an iterator.
  * `Match` object: has methods `.group`, `.span`, `.start`, and `.stop`.
  * `sub`: substitutes the pattern with another given string in the input string.
  * `split`: splits the input string on the pattern.
  * `compile`: creates a regex object that can be reused.

In [1]:
import re

In [7]:
"One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them."

'One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them.'

### `match` and `search`
`match` matches only at the beginning of the string, `search` can find the first occurence even if it is somewhere in the middle of the string. They both return a `re.Match` object.

In [None]:
re.match(r"one", "One ring to rule them all, One ring to find them", re.IGNORECASE)

<re.Match object; span=(0, 3), match='One'>

In [None]:
re.search(r"one", "One Ring to rule them all, One ring to find them", re.IGNORECASE)

<re.Match object; span=(0, 3), match='One'>

In [19]:
re.match(f"ring", "One ring to rule them all, One ring to find them") is None

True

In [20]:
re.search(r"ring", "One ring to rule them all, One ring to find them")

<re.Match object; span=(4, 8), match='ring'>

### `findall` and `finditer`
Both of these do a "global" search and find **all** occurences of the pattern in the input string. `findall` returns a list, `finditer` yields `re.Match` objects in an iterator.

In [23]:
re.findall(r"\d+", "A dozen has 12 items, but a baker's dozen has 13 items!")

['12', '13']

In [24]:
for match in re.finditer(r"\d+", "A dozen has 12 items, but a baker's dozen has 13 items!"):
    print(match)

<re.Match object; span=(12, 14), match='12'>
<re.Match object; span=(46, 48), match='13'>


### `re.Match` object
It has the following four methods -
  * `start()`: the starting index of the match.
  * `end()`: the ending index of the match.
  * `span()`: A tuple with the start and the end indices.
  * `group()`: The match itself.

In [28]:
match = re.search(r"ring", "One ring to rule them all, One ring to find them")
print(match.start())
print(match.end())
print(match.span())

4
8
(4, 8)


In [29]:
match.group()

'ring'

### Grouping
This is just normal regex grouping. The `re.Match.group(n)` can be used to retrieve the groups.

In [48]:
match = re.search(r"(\w+)@(\w+\.\w+)", "My email is avilay@gmail.com for most things.")

In [49]:
match.group()

'avilay@gmail.com'

In [50]:
match.groups()

('avilay', 'gmail.com')

In [51]:
match.group(1)

'avilay'

In [52]:
match.group(2)

'gmail.com'

In [53]:
match.group(3)

IndexError: no such group

Grouping does not work with `findall` but does work with `finditer`.

In [None]:
re.findall(r"(\d+)\s+items", "A dozen has 12 items, but a baker's dozen has 13 items!")

['12', '13']

In [46]:
for match in re.finditer(r"(\d+)\s+items", "A dozen has 12 items, but a baker's dozen has 13 items!"):
    print(match.group())
    print(match.group(1))

12 items
12
13 items
13


### `sub`
I can replace all the occurrences of the pattern with its replacement, or I can replace the first n occurences only, leaving the rest as-is. The api is -
```python
sub(pattern, replacement, input_string, count=0, flags=0)
```
Unless I am using regex as the pattern, it is often faster to use the `str.replace` method.
```python
input_string.replace(old, replacement, count=-1)
```

In [77]:
re.sub(
    r"one\s+ring", 
    "💍", 
    "One ring to rule them all, one Ring to find them", 
    flags=re.IGNORECASE
)

'💍 to rule them all, 💍 to find them'

In [78]:
re.sub(
    r"one\s+ring",
    "💍",
    "One ring to rule them all, one Ring to find them",
    count=1,
    flags=re.IGNORECASE,
)

'💍 to rule them all, one Ring to find them'

In [79]:
"One ring to rule them all, one ring to find them".replace("ring", "💍")

'One 💍 to rule them all, one 💍 to find them'

### `split`
The basic scenario is to split a string on a regex. Like `sub`, I can choose to stop after some number of splits. 

```python
re.split(pattern, input_string, maxsplit=0, flags=0)
```
If I group the regex, I can get the matched regex along with the splits in the output.

If I want to split on whitespace or a non-regex string, then I am better off using `str.split`.
```python
input_string.split(sep=None, maxsplit=-1)
```


In [82]:
text = "one ring to rule them all, one ring to find them. one ring to bring them all, and in the darkness bind them."
re.split(r"one\s+ring", text)

['',
 ' to rule them all, ',
 ' to find them. ',
 ' to bring them all, and in the darkness bind them.']

In [83]:
re.split(r"(one\s+ring)", text)

['',
 'one ring',
 ' to rule them all, ',
 'one ring',
 ' to find them. ',
 'one ring',
 ' to bring them all, and in the darkness bind them.']

In [84]:
"One\nring\tto   rule\r\nthem\t\rall".split()

['One', 'ring', 'to', 'rule', 'them', 'all']

### `compile`
If I am going to be using the same regex multiple times, it is better to create a regex object using the `re.compile` API. Everywhere I use `re.<api>(...)` I can substitute it with -
```python
pattern = re.compile(<regex pattern>)
pattern.<api>(...)
```

In [87]:
pattern = re.compile(r"ring")
pattern.search("One ring to rule them all, One ring to find them")

<re.Match object; span=(4, 8), match='ring'>