# How to regular expression?

Before moving forward, read the following:

- [Predefined character sets >>>](101-predefined-character-set.md)
- [Special Metacharacters >>>](102-special-metacharacter.md)


In [1]:
# importing regular expression module
import re

# Using regular expression

The `re` module provides an interface to the regular expression engine, allowing you to compile REs into objects and then preform matches with them.


### Compiling Regular Expressions

Regular expressions are compiled into `pattern objects`, which have methods for various operations such as searching for pattern matches or performing string subsitutions.

In [2]:
p = re.compile('ab*')
p

re.compile(r'ab*', re.UNICODE)

The RE is passed to `re.compile()` as a string.

### Using raw strings

Above while compiling the regular expression, we pass the regular expression as string. But, in real usage, we pass the RE as *raw string*. Let's understand with an example.

If were want to match a newline character using RE, we have to pass "\n" as a string as RE. But, since we are not escaping the the backslash in before "n", Python might think we trying to use a newline here, which would generate error. To overcome this, we use raw string. The raw string will not interpret "\n" as any special character, and will process it as just simple string.

To create a raw string, we have to preceed the string with "r", r"hello\nraw string" a simple raw string. If you print it, the "\n" character will not be formatted.

In [18]:
print(r"Hello\nraw string")

Hello\nraw string


But, if you remove the 'r', and then print it, the newline will be formated.

In [19]:
print("Hello\nraw string")

Hello
raw string


### Preformaing matches

- **match()**: Determine if the RE matches the begining of the string.
- **search()**: Scan through a  string, looking for any location where this RE matches.
- **findall()**: Find all substrings where the RE matches, and returns them as a list.
- **finditer()**: Find all substrings where the RE matches, and returns them as an iterator.


The `match()` and `search()` return `None` if no match can be found. If they're successfull, a `match object()` instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [3]:
p = re.compile('[a-z]+')
p 

re.compile(r'[a-z]+', re.UNICODE)

In [4]:
# trying to match empty string

p.match("")

In [5]:
# matching a string `tempo`

m = p.match('tempo')
m

<re.Match object; span=(0, 5), match='tempo'>

### Querying match object

Match object instances have several methods and attributes; the most important ones are:

- `group()`: Return the substring by the RE.
- `start()`: Return starting position of the match.
- `end()`: Return the ending position of the match.
- `span()`: Return a tuple containing the (start, end) positions of the match.

In [6]:
m.group()

'tempo'

In [7]:
m.start(), m.end()

(0, 5)

In [8]:
m.span()

(0, 5)

### Two pattern matching method

`findall()` returns a list of matching strings. It has to create the entire list  before it can be returned as the result.

In [9]:
p = re.compile(r"\d+")
p.findall("12 drummers drumming, 11 pipers piping, 10 lords a leaping")

['12', '11', '10']

`finditer()` method returns a sequence of `match object` instances as an iterator.

In [10]:
iterator = p.finditer("12 drummers drumming, 11 pipers piping, 10 lords a leaping")
iterator

<callable_iterator at 0x1ef2b5671c0>

In [11]:
for match in iterator:
    print(match.span())

(0, 2)
(22, 24)
(40, 42)


### Module-levels functions

`re` also provides top-level functions called `match()`, `search()`, `findall()`, `sub()`, and so forth. These functions take the same arguments as the coresponsing pattern method with the RE string added as the first argument, and still return either `None` or a *match object* instance.

In [21]:
print(re.match(r"From\s+", "Fromage amk"))

None


In [22]:
re.match(r"From\s+", "From aml Thu May 14 19:12:10 1998")

<re.Match object; span=(0, 5), match='From '>

Under the hood, these functions simply create a pattern object for you and call the appropriate method on it.

### Compilation Flags

Compilation flags let you modify some aspects of how regular expression work. Flags are available under two names:

1. Long name such as `IGNORECASE`.
2. Short name such as `I`

Mulitple flags can be specified by bitwise 