# Regex Cheatsheet
*Quick Reference*

## Anchors

|Anchors                |Description                                |
|-----------------------|-------------------------------------------|
| \A                    | match start of string                     |
| \Z                    | match end of string                       |
| ^                     | match start of line                       |
| $                     | end of line                               |
| \b                    | start/end of words                        |
| \B                    | inverse of \b                             |

## Character Groups

|Groups| Description|
|------------------------|------------------------------------------|
| (?:)                  | non-capturing                             |
| (?P<name>)            | named capture group                       |
| .                     | any except \n                             |
| []                    | character classes                         |
| *                     | 0 or more                                 |
| +                     | 1 or more                                 |
| ?                     | 0 or 1                                    |
| {3,5}                 | 3 to 5 times                              |
| pat1.*pat2|pat2.*pat1 | match both pat1 and pat2 but in any order |

## Character Class

| Class | Description                                         |
|-------------------|-----------------------------------------------------|
| [ABC]             | Match any character in the set                      |
| [^ABC]            | Match any character not in the set                  |
| [A-z]             | Matches a range                                     |
| .                 | Match any except linebreaks. Shortcut for [^\n\r]   |
| \w                | Match word chars. Shortcut for [A-Za-z0-9_]         |
| \W                | Negated ^w. Shortcut for [^A-Za-z0-9_]              |
| \d                | Shortcut for [0-9]                                  |
| \D                | Shortcut for [^0-9]                                 |
| \s                | Whitespace                                          |
| [\uxxx-\uxxy]     | Match a character in range (see below)               |

## Tips

### Unicode Representation

Imagine you've normalized text. It looks "normal" until you see this:

> some_ῷstring

Rather than try to catch these individually, a smarter approach is to leverage unicode representation.

These are listed online but Python can do this natively.

In [1]:
import unicodedata
s = "ῷ"
unicode_repr = s.encode("unicode_escape").decode('ascii')

print(f"{s} : {unicode_repr}")
print(unicodedata.name("ῷ"))

ῷ : \u1ff7
GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI


This can be represented in a regex pattern:

In [2]:
import re
re.findall(r"(\u1ff7)", "some_ῷstring")

['ῷ']

You could also use it as a range. We know the Unicode code point for "ῷ" is 8183
after running
```python
ord("ῷ")
>>> 8183
```

In [3]:
import re

chars = sorted([chr((8183 - x)).encode("unicode_escape").decode('ascii') for x in [100, -100]])
ranged_unicode_string = f"[{chars[0]}-{chars[1]}]"
print(ranged_unicode_string)


[\u1f93-\u205b]


In [4]:
re.compile(ranged_unicode_string).findall("'ᾚ''ῷ῾")

['ᾚ', 'ῷ', '῾']

### Import regex as re

[regex](https://pypi.org/project/regex/) is a 3rd party library that provides more advanced functionality.
It's mostly a drop-in replacement so it's common to see
```python
import regex as re
```

Using [regex](https://pypi.org/project/regex/) we can take advantage of [Unicode Categories](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category),
and [Unicode Blocks](https://www.regular-expressions.info/unicode.html)


In [5]:
import regex as re


# test string
chars = "".join([chr(i) for i in range(32, 10000) if chr(i).isprintable()])
chars = "".join(chars)

result = re.findall("\p{InBasicLatin}", chars)
print(result)



[' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']
