**tutorial is from here : **
https://docs.python.org/3.6/howto/regex.html

**here's a good guide : ** https://www.rexegg.com/

# Using regular expression with `re` module

Regular expression are tiny, highly specialized programming language
embedded inside Python. They are also called REs, regexes or regex 
patterns.

Using regexes you specify the rules for the set of possible strings 
that you want to match. You can then ask : 'Does this string match the pattern?' You can also use REs to modify a string or to split it apart it various ways.

Regexes are compiled into a series of bytecodes which are then executed by a matching engine written in C. 

Regexes language is relatively small and restricted. So, not all possible string processing tasks can be done using them. There are also tasks that can be done with regexes, but the resulting expression turn out to be very complicated. In this case you may be better off writing some Python code to process the text. Although Python code will be slower than regexes, it will be more understandable.


## Simple Patterns



We'll begin with the most common task: matching characters.


## Matching Characters

Most letters and characters will simply match themselves. There are exceptions for this rule: some characters are special metacharacters. They don't match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the regex by repeating them or changing their meaning.

Here's a complete list of the metacharacters:

`. ^ $ * + ? { } [ ] \ | ( )` 

`[ ]` are used to specify a character class. It is a set of characters that you with to match. Characters can  be listed individually. To indicate a range of characters, we use two chars and separate them by `-`

F.e. `[abc]` will match any of the characters `a`, `b` or `c`. 
This is the same as `[a-c]`. 

If you want to match only lowercase letters, your regex will be : `[a-z]`

Metacharacters are not active inside classes. F.e. `[akm$]` will match any of the `a`, `k`, `m` or `$`. `$` is usually a metacharacter but inside a character class it's stripped of its special nature.

To match the characters not listed in the class, you should _complement_ the set. To do it, include `^` as the first character of the class. `^` outside a character class will simply match the `^` character. F.e., `[^5]` will match any character except `5`

Perhaps the most important metacharacter is the backslash `\`. It can be followed by various characters to signal various special sequences. It's also used to escape all  the metacharacters. So you can still match them in patterns. F.e. if you need to match a `[` or `\`, you can precede them with a backslash : `\[` or `\\` - it will remove their special meaning.



Some of the sequences beginning with `\` represent prefefined sets of characters.

This list is not complete:

`\d` - any decimal digit, equals class: `[0-9`]

`\D` - any non-digit character, the same as class: `[^0-9]`

`\s` - any whitespace character, the same : `[\t\n\r\f\v]`


`\S` - any non-whitespace character, the same: `[^\t\n\r\f\v]`

`\w` - any alphanumeric character, the same : `[a-zA-Z0-9_]`

`\W` - any non-alphanumeric character, the same: `[^a-zA-Z0-9_]`


These sequences can be included inside a character class. F.e. `[\s,.]` will match any whitespace char or `,` or `.`


The final metachracter in this section is : `.`.  It matches anything except a newline character. It's used where you want to match 'any character'




























## Repeating

`*`  metacharacter says that previous char can be matched zero of more times, instead of exactly one.

f.e. `ca*t` will match `ct` (0 `a` chars), `cat` (1 `a`), `caaat` (3 `a` chars), and so forth.

Repetitions such as `*` are greedy. The matching engine will try to repeat as many times as possible. If later protions of the pattern don't match, the matching engine will then back up adn try again with fewer repetitiosns

`+` is another repeating metachar. It matches one or more times. Minde the difference : `*` matches zero or more times, so whatever is repeated may not be present at all. `+` requires at least one occurence. 

`ca+t` will match `cat` (1 `a`), `caaat` (3 `a`), but won't match `ct`

`?` matches once or zero times. It's like something is optional. 
f.e. `home-?brew` matches either `homebrew` or `home-brew`

`{m, n}` is the most complicated.  _m_ and _n_ are decimal integers. There must be at least _m_ repeatitions and at most _n_. 

f.e. `a/{1,3}b` will match `a/b`, `a//b` and `a///b`. It won't match `ab` which has no slashes, or `a////b` which has four.

It you omin _m_ lowe limit is 0, while omitting _n_ results in an upper bound of infinity

`{0,}` is the same as `*`

`{1,}` is equivalent to `+`

`{0,1}` is the same as `?`

It's better to use `*`, `+` or `?` because they're shorter and easier to read.

# Using regular expression

`re` module provides an interface to the regular expression engine. It allows to compile regexes into objects and then perform matches on them.

## Compiling Regular Expressions

Regexes are compiled into pattern objects. These objects have methods for operations : f.e. searching for a pattern or substituting a string.

In [1]:
import re
p = re.compile('ab*')
p

re.compile(r'ab*', re.UNICODE)

```re.compile()``` also accepts optional flags. These flags enable various special features and syntax variations. We'll turn to the flags later but now the example will be :

In [2]:
p = re.compile('ab*', re.IGNORECASE)

regex is passed to ```re.compile()```as a string. Because regexes aren't a part of the core Python language. So, no special syntax was created for expressing them. Instead, ```re``` is simply a C extension included in Python, like ```socket``` or ```zlib```

Putting regexes as strings keeps language simpler, but has one disadvantage which is the topic of the next section

## The Backslah Plague

As mentioned earlier, regexes use the baskslash char ```\``` to indicate special forms or to allow special chars to be used without invoking their special meaning.  

But this conflicts with Python's usage of the same char for the same purpose in string literals

F.e. you want to write regex that matches the string ```\section``` which might be found in LaTeX file. Start with  the desired string to be matched. Next, escape any backslashes and other metachars by backslashes : ```\\section```. However, to express this as a Python string literal, both backslashes must be escaped again.

| characters | Stage |
| --- | --- |
|```\section```|Text string to be matched|
|```\\section```|Escaped backslash for ```re.compile()```|
|```\\\\section```|Escaped backslashes for a string literal|

The solution is to use Python's raw string notation. In the string prefixed with ```r``` baskslashes are not handled in any special way. 

So, ```r"\n"``` has two chars : ```\``` and ```n```

In addition, special escape sequences that are valid in regexes but not valid in Python now result in a ``` DeprecationWarning``` and become ```SyntaxError```. It means the sequence will be invalid if raw string notation or escaping the backslashes ins't used

|Regular String | Raw string|
|---|---|
|```ab*```| ```ab*```|
|```\\\\section``` |   ```r\\section```|
|```\\w+\\s+\\1``` | ```r\w+\s+\1``` |

## Performing Matches

Once you have an obj which represents a compiled regex, you can use methods and attributes. Most significant are :

- ```match()``` - determine if regex matches at the beginning of the string
- ```search()``` - scan through a string, looking for any location where this regex matches
- ```findall()``` - find all substring where the regex matches, and returns them as a list
- ```finditer()``` - find all substrings where the regex matches, and returns them as an iterator

```match()``` and ```search()``` return ```None``` if no match found. If successful, a match object instance is returned. This match object contains information about the match : where it starts and ends, the substring it matched, and more

In [1]:
import re
p = re.compile('[a-z]+')
p

re.compile(r'[a-z]+', re.UNICODE)

now you can match various strings agains regex ```[a-z]+```. An empty string shouldn't match at all, since ```+``` means 'one or more repetitions'. ```match()``` should return ```None``` in this case. 

In [5]:
print(p.match(""))

None


now let's try a string that'll match :

In [7]:
m = p.match('tempo')
m

<_sre.SRE_Match object; span=(0, 5), match='tempo'>

now you can query match object for information about the matching string. Most important methods and attrs are :

- ```group()``` - return the string matched by the regex
- ```start()``` - return the starting position of the match
- ```end()``` - return the ending position of the match
- ```span()``` - return a tuple containing the (start, end) positions of the match

Try these methods to clarify their meaning :

In [8]:
m.group()

'tempo'

In [9]:
m.start(), m.end()

(0, 5)

In [10]:
m.span()

(0, 5)

```group()``` returns the substring that was matched by the regex.

```start()``` and ```end()``` return the starting and ending index of the match.

```span()``` returns both start and end in a single tuple. 

Since ```match()``` checks only if the regex matches at the start of the string, ```start()``` will always be zero. However, ```search()``` scans through the string, so the match may not start at zero in that case 

In [11]:
print(p.match('::: message'))

None


In [12]:
m = p.search('::: message'); print(m)

<_sre.SRE_Match object; span=(4, 11), match='message'>


In [13]:
m.group()

'message'

In [14]:
m.span()

(4, 11)

In actual programs, the most common style is to store the match object in a variable, and then check if it was ```None```. This usually looks like :

In [17]:
p = re.compile('[a-z]+' )
m = p.match('strings goes here')
if m:
    print('match found: ', m.group())
else:
    print('No match')

match found:  strings


```findall``` returns all the matches in a list :

In [18]:
p = re.compile(r'\d+')
p.findall('12 drummers drumming, 11 pipers piping, 10 lords \
a-leaping')

['12', '11', '10']

The ```r``` prefix making the literal a raw string literal, is needed here as escape sequences in a normal "cooked" string literal that are not recognized by Python, as opposed to regexes, now result in a ```DeprecationWarning``` and become ```SyntaxError```

```finditer()``` returns a sequence of match objects as an iterator :

In [19]:
iterator = p.finditer('12 drummes, 11 pipers, 10 lords...')
iterator

<callable_iterator at 0x1c99ec70518>

In [20]:
for match in iterator:
    print(match.span())

(0, 2)
(12, 14)
(23, 25)
