# Regular expressions

[Great tool for testing regular expressions - regex101.com](https://regex101.com/)

[Regular expressions on realpython (paid)](https://realpython.com/lessons/fun-further-reading/)

What tokens we can use with regular expression:
- `\d` - digit
- `\D` - not a digit - applies to other tokens as well
- `\s` - whitespace character, like space, tab, new line
- `\w` - matches a-z, A-Z, 0-9, _
- `^` - start of the string
- `$` - end of the string
- `[]` - character class - match a single character from a given set of characters
- `.` - matches any character

All the token above refer to one, singular character. To have match to several characters (based on the tokens above) we can use multiplicity:
- `{X}` - the character before needs to be repeated exactly X times
- `{X,}` - the character before needs to be repeated at least X times
- `{X,Y}` - the character before needs to be repeated between X and Y times

we have some shortcuts for creating common multiplicities:
- `*` = `{0,}`
- `+` = `{1,}`
- `?` = `{0,1}`

If we want to look for a character that has a special meaning in regular expressions we have to escape this character using `\`, for example `\.`.

If we want to combine several regular expressions together we can use `|`.

We can use `()` for groupping which will be useful when dealing with substitutions. Each capturing group, defined by `()` will allow us to use this portion of a matched text when replacing it with something else. We can referer to capturing group by `\X` - (X - number of `()` pair).

In python we have `re` module for handling regular expressions. [Docs](https://docs.python.org/3/library/re.html)

What is available:
- `.findall()` - return all non-overlapping matches of pattern in string, as a list of strings or tuples.
- `.finditer()` - matches are returned one by one, in contrast to `findall` where we get all matches at one, better when dealing with huge texts. Instead of a list returns Match object for each match.
- `.search()` - returns the first occurence
- `.match()` - if zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
- `.sub()` - we can look for matches as in the regular expression and replace it with something different.
- `.split()` - we can split a string by a regular expresssion - is similar to string `.split()` method but we have possibility to use regular expression to split a string into parts.

In [1]:
import re

In [2]:
text = """To be or not to be.
ab-abc
31-123
00-123
41-422
123123123-123123123123
1a
1b
1c
2a
3b
4a
13a
"""

In [4]:
matches = re.findall(r'\d{2}-\d{3}', text)
matches

['31-123', '00-123', '41-422', '23-123']

In [6]:
matches = re.findall(r'(\d{2})-(\d{3})', text)
matches

[('31', '123'), ('00', '123'), ('41', '422'), ('23', '123')]

In [9]:
for match in re.finditer(r'(\d{2})-(\d{3})', text):
    print(match, match.group(0), match.group(1))

<re.Match object; span=(27, 33), match='31-123'> 31-123 31
<re.Match object; span=(34, 40), match='00-123'> 00-123 00
<re.Match object; span=(41, 47), match='41-422'> 41-422 41
<re.Match object; span=(55, 61), match='23-123'> 23-123 23


In [12]:
new_text = re.sub(r'(\d{2})-(\d{3})', r'\2-\1', text)
print(text)
print(new_text)

To be or not to be.
ab-abc
31-123
00-123
41-422
123123123-123123123123
1a
1b
1c
2a
3b
4a
13a

To be or not to be.
ab-abc
123-31
123-00
422-41
1231231123-23123123123
1a
1b
1c
2a
3b
4a
13a

