# Using regular expression with `re` module

Regular expression are tiny, highly specialized programming language
embedded inside Python. They are also called REs, regexes or regex 
patterns.

Using regexes you specify the rules for the set of possible strings 
that you want to match. You can then ask : 'Does this string match the pattern?' You can also use REs to modify a string or to split it apart it various ways.

Regexes are compiled into a series of bytecodes which are then executed by a matching engine written in C. 

Regexes language is relatively small and restricted. So, not all possible string processing tasks can be done using them. There are also tasks that can be done with regexes, but the resulting expression turn out to be very complicated. In this case you may be better off writing some Python code to process the text. Although Python code will be slower than regexes, it will be more understandable.


## Simple Patterns



We'll begin with the most common task: matching characters.


## Matching Characters

Most letters and characters will simply match themselves. There are exceptions for this rule: some characters are special metacharacters. They don't match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the regex by repeating them or changing their meaning.

Here's a complete list of the metacharacters:

`. ^ $ * + ? { } [ ] \ | ( )` 

`[ ]` are used to specify a character class. It is a set of characters that you with to match. Characters can  be listed individually. To indicate a range of characters, we use two chars and separate them by `-`

F.e. `[abc]` will match any of the characters `a`, `b` or `c`. 
This is the same as `[a-c]`. 

If you want to match only lowercase letters, your regex will be : `[a-z]`

Metacharacters are not active inside classes. F.e. `[akm$]` will match any of the `a`, `k`, `m` or `$`. `$` is usually a metacharacter but inside a character class it's stripped of its special nature.

To match the characters not listed in the class, you should _complement_ the set. To do it, include `^` as the first character of the class. `^` outside a character class will simply match the `^` character. F.e., `[^5]` will match any character except `5`

Perhaps the most important metacharacter is the backslash `\`. It can be followed by various characters to signal various special sequences. It's also used to escape all  the metacharacters. So you can still match them in patterns. F.e. if you need to match a `[` or `\`, you can precede them with a backslash : `\[` or `\\` - it will remove their special meaning.



Some of the sequences beginning with `\` represent prefefined sets of characters.

This list is not complete:

`\d` - any decimal digit, equals class: `[0-9`]

`\D` - any non-digit character, the same as class: `[^0-9]`

`\s` - any whitespace character, the same : `[\t\n\r\f\v]`


`\S` - any non-whitespace character, the same: `[^\t\n\r\f\v]`

`\w` - any alphanumeric character, the same : `[a-zA-Z0-9_]`

`\W` - any non-alphanumeric character, the same: `[^a-zA-Z0-9_]`


These sequences can be included inside a character class. F.e. `[\s,.]` will match any whitespace char or `,` or `.`


The final metachracter in this section is : `.`.  It matches anything except a newline character. It's used where you want to match 'any character'




























## Repeating

`*`  metacharacter says that previous char can be matched zero of more times, instead of exactly one.

f.e. `ca*t` will match `ct` (0 `a` chars), `cat` (1 `a`), `caaat` (3 `a` chars), and so forth.

Repetitions such as `*` are greedy. The matching engine will try to repeat as many times as possible. If later protions of the pattern don't match, the matching engine will then back up adn try again with fewer repetitiosns

`+` is another repeating metachar. It matches one or more times. Minde the difference : `*` matches zero or more times, so whatever is repeated may not be present at all. `+` requires at least one occurence. 

`ca+t` will match `cat` (1 `a`), `caaat` (3 `a`), but won't match `ct`

`?` matches once or zero times. It's like something is optional. 
f.e. `home-?brew` matches either `homebrew` or `home-brew`

`{m, n}` is the most complicated.  _m_ and _n_ are decimal integers. There must be at least _m_ repeatitions and at most _n_. 

f.e. `a/{1,3}b` will match `a/b`, `a//b` and `a///b`. It won't match `ab` which has no slashes, or `a////b` which has four.

It you omin _m_ lowe limit is 0, while omitting _n_ results in an upper bound of infinity

`{0,}` is the same as `*`

`{1,}` is equivalent to `+`

`{0,1}` is the same as `?`

It's better to use `*`, `+` or `?` because they're shorter and easier to read.

# Using regular expression

`re` module provides an interface to the regular expression engine. It allows to compile regexes into objects and then perform matches on them.