## Understanding the Regular Expression Syntax

A regex pattern is a simple sequence of characters. The components of a regex pattern are:

- **literals (ordinary characters)**: these characters carry no special meaning and are processed as it is.

- **metacharacters (special characters)**: these characters carry a special meaning and processed in some special way.


![](images/components.png)


This expression can be understood as follows:

- `file` is a substring of literals which are matched with the input as it is.

- `\d` is a metacharacter which instructs the software to match this position with a digit (0-9).

- `+` is also a metacharacter which instructs the software to match one or more iterations of the preceeding character (`\d` in this case)

- `\.` is a literal. `.` is a metacharacter but we want to use it as a literal in this case. Hence, we escape it using `\` character.

- `txt` is a substring of literals which are matched with the input as it is.

![](images/example1.png)

In [186]:
import re
from utils import highlight_regex_matches

## 1. Compiling Regular Expressions

Regular expressions are **compiled** into `Pattern` objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.


### `re.compile(pattern, flags=0)`

Compile a regular expression pattern, returning a pattern object.

- The regular expression is passed to `re.compile()` as a **string**. 

> Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. 

> Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

In [17]:
pattern = re.compile("hello", flags=re.I)

In [18]:
pattern

re.compile(r'hello', re.IGNORECASE|re.UNICODE)

## 2. Performing Matches

So, we have created a `Pattern` object representing a compiled regular expression using `re.compile()` method.

Here is the list of different methods used for performing matches:


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Method/Attribute</th>
    <th>Purpose</th>
</thead>
    
<tbody>
<tr>
    <td>match()</td>
    <td>Determine if the RE matches at the beginning of the string.</td>
</tr>
    
<tr>
    <td>search()</td>
    <td>Scan through a string, looking for any location where this RE matches.</td>
</tr>

<tr>
    <td>findall()</td>
    <td>Find all substrings where the RE matches, and returns them as a list.</td>
</tr>

<tr>
    <td>finditer()</td>
    <td>Find all substrings where the RE matches, and returns them as an iterator.</td>
</tr>
</tbody>
</table>

Let us go through them one by one:

### `match(string[, pos[, endpos]])`

- A match is checked only at the beginning (by default).

- Checking starts from `pos` index of the string. (default is 0)

- Checking is done until `endpos` index of string. `endpos` is set as a very large integer (by default).

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [19]:
pattern = re.compile("hello")

In [20]:
match = pattern.match("hello world")

In [21]:
match.span()

(0, 5)

In [22]:
match.start()

0

In [23]:
match.end()

5

In [24]:
pattern.match("say hello", pos=4) is None

False

In [25]:
pattern.match("hello", endpos=4) is None

True

### `search(string[, pos[, endpos]])`

- A match is checked throughtout the string.

- Same behaviour of `pos` and `endpos` as the `match()` function.

- Returns `None` if no match found.

- If a match is found, a `Match` object is returned.

In [26]:
pattern.search("say hello")

<re.Match object; span=(4, 9), match='hello'>

In [27]:
pattern.search("say hello hello")

<re.Match object; span=(4, 9), match='hello'>

### `findall(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as a list.

- Same behaviour of `pos` and `endpos` as the `match()` and `search()` function.

In [28]:
pattern.findall("say hello hello")

['hello', 'hello']

### `finditer(string[, pos[, endpos]])`

- Finds **all non-overlapping substrings** where the match is found, and returns them as an iterator of the `Match` objects.

- Same behaviour of `pos` and `endpos` as the `match()`, `search()` and `findall()` function.

In [29]:
matches = pattern.finditer("say hello hello")

In [30]:
pattern = re.compile("\$15")

In [31]:
pattern.search(txt)

In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:

- Backslash `\`
- Caret `^`
- Dollar sign `$`
- Dot `.`
- Pipe symbol `|`
- Question mark `?`
- Asterisk `*`
- Plus sign `+`
- Opening parenthesis `(`
- Closing parenthesis `)`
- Opening square bracket `[`
- The opening curly brace `{`

# Character Classes

- The **character classes** (also known as **character sets**) allow us to define a character that will match if any of the defined characters on the set is present.


- To define a character class, we should use the opening square bracket metacharacter `[`, then any accepted characters, and finally close with a closing square bracket `]`.

### Example 1

Consider an example below where we have messed up between `license` and `licence` spellings and want to find all occurances of `license`/`licence` in the text.

In [32]:
txt = """
Yesterday, I was driving my car without a driving licence. The traffic police stopped me and asked me for my 
license. I told them that I forgot my licence at home. 
"""

In [33]:
pattern = re.compile("licen[cs]e")

In [34]:
pattern.findall(txt)

['licence', 'license', 'licence']

![](images/example2.png)

# Character Set Range

> It is possible to also use the range of a character. This is done by leveraging the hyphen symbol (-) between two related characters; for example, to match any lowercase letter we can use `[a-z]`. Likewise, to match any single digit we can define the character set `[0-9]`.

Let us consider an example in which we want to retrieve all the years from the given text.

In [35]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [36]:
pattern = re.compile("[1-9][0-9][0-9][0-9]")

In [37]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

> There is another possibility—the negation of ranges. We can invert the meaning
of a character set by placing a caret (`^`) symbol right after the opening square
bracket metacharacter (`[`).

For example, to find all the characters used in a text except vowels, we can use the pattern:

In [38]:
pattern = re.compile("[^aeiou]")

In [39]:
pattern.findall(txt)

['\n',
 'T',
 'h',
 ' ',
 'f',
 'r',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'f',
 ' ',
 'I',
 'n',
 'd',
 'n',
 ' ',
 'P',
 'r',
 'm',
 'r',
 ' ',
 'L',
 'g',
 ' ',
 '(',
 'I',
 'P',
 'L',
 ')',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '8',
 '.',
 ' ',
 '\n',
 'T',
 'h',
 ' ',
 's',
 'c',
 'n',
 'd',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '0',
 '9',
 ' ',
 'n',
 ' ',
 'S',
 't',
 'h',
 ' ',
 'A',
 'f',
 'r',
 'c',
 '.',
 ' ',
 '\n',
 'L',
 's',
 't',
 ' ',
 's',
 's',
 'n',
 ' ',
 'w',
 's',
 ' ',
 'p',
 'l',
 'y',
 'd',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '8',
 ' ',
 'n',
 'd',
 ' ',
 'w',
 'n',
 ' ',
 'b',
 'y',
 ' ',
 'C',
 'h',
 'n',
 'n',
 ' ',
 'S',
 'p',
 'r',
 ' ',
 'K',
 'n',
 'g',
 's',
 ' ',
 '(',
 'C',
 'S',
 'K',
 ')',
 '.',
 '\n',
 'C',
 'S',
 'K',
 ' ',
 'w',
 'n',
 ' ',
 't',
 'h',
 ' ',
 't',
 't',
 'l',
 ' ',
 'n',
 ' ',
 '2',
 '0',
 '1',
 '0',
 ' ',
 'n',


# Predefined Character Classes

There exist some predefined character classes which can be used as a shortcut for some frequently used classes.


<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Element</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>.</td>
    <td>This element matches any character except newline</td>
</tr>

<tr>
    <td>\d</td>
    <td>This matches any decimal digit; this is equivalent to the class [0-9]</td>
</tr>

<tr>
    <td>\D</td>
    <td>This matches any non-digit character; this is equivalent to the class [^0-9]</td>
</tr>

<tr>
    <td>\s</td>
    <td>This matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\S</td>
    <td>This matches any non-whitespace character; this is equivalent to the class
[^ \t\n\r\f\v]</td>
</tr>

<tr>
    <td>\w</td>
    <td>This matches any alphanumeric character; this is equivalent to the class
[a-zA-Z0-9_]</td>
</tr>
    
<tr>
    <td>\W</td>
    <td>This matches any non-alphanumeric character; this is equivalent to the
class [^a-zA-Z0-9_]</td>
</tr>
</tbody>
</table>


Now, we can improve our pattern to find years in a given text a bit:

In [40]:
pattern = re.compile("[1-9]\d\d\d")

In [41]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

Let us try to find out all special symbols (non-alphanumeric, non-whitespace characters) in our text now.

In [42]:
re.findall("[^\w\s]", txt)

['(', ')', '.', '.', '(', ')', '.', '.', '(', ')', ',', '.']

# The Backslash Plague

Consider a text containing some Windows style directory addresses in which we have to find `C:\Windows\System32` substring.

In [43]:
txt = """
C:\Windows
C:\Python
C:\Windows\System32
"""

In [44]:
pattern = re.compile("C:\Windows\System32")

In [45]:
pattern.search(txt)

### Why are no matches found for above pattern?

Regex Engine is treateing `\` as metacharacters, whereas we intend to treat it like a literal.

In [46]:
pattern = re.compile("C:\\\Windows\\\System32")

In [47]:
pattern.search(txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

In [48]:
pattern = re.compile(r"C:\\Windows\\System32")

In [49]:
pattern.search(txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

In [50]:
re.escape("C:\Windows\System32")

'C:\\\\Windows\\\\System32'

In [51]:
re.search(re.escape("C:\Windows\System32"), txt)

<re.Match object; span=(22, 41), match='C:\\Windows\\System32'>

# Alteration

Just like character classes are used to match a single character out of several possible characters, **alternation** is used to match a single regular expression out of several possible regular expressions.

This is accomplished using the pipe symbol `|`.

Consider a scenario where you want to find all occurances of `and`, `or`, `the` in a given text.

> One way is to write and execute 3 separate regular expressions. Using alteration, it can be done in a single regular expression!

In [52]:
txt = """
the most common conjunctions are and, or and but.
"""

In [53]:
pattern = re.compile("and|or|the")

In [54]:
pattern.findall(txt)

['the', 'and', 'or', 'and']

Consider one more example now in which we want to search the substrings `What is` and `Who is`.

In [55]:
txt = """
What is your name?
Who is that guy?
"""

In [56]:
pattern = re.compile("What|Who is")

`What|Who is` regex pattern actually matches substrings `What` and `Who is`.

To get the desired result, we need to wrap the optional regular expressions using **paranthesis**.

In [57]:
pattern = re.compile("(What|Who) is")

# Quantifiers

**Quantifiers** are the mechanisms to define how a **character**, **metacharacter**, or **character set** can be **repeated**.

Here is the list of 4 basic quantifers:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Symbol</th>
    <th>Name</th>
    <th>Quantification of previous character</th>
</thead>
    
<tbody>
<tr>
    <td>?</td>
    <td>Question Mark</td>
    <td>Optional (0 or 1 repetitions)</td>
</tr>
    
<tr>
    <td>*</td>
    <td>Asterisk</td>
    <td>Zero or more times</td>
</tr>

<tr>
    <td>+</td>
    <td>Plus Sign</td>
    <td>One or more times</td>
</tr>

<tr>
    <td>{n,m}</td>
    <td>Curly Braces</td>
    <td>Between n and m times</td>
</tr>
</tbody>
</table>


Let us go through different examples to understand them one by one.

### Example 1

Find all the matches for `dog` and `dogs` in the given text.

In [58]:
txt = """
I have 2 dogs. One dog is 1 year old and other one is 2 years old. Both dogs are very cute! 
"""

In [59]:
pattern = re.compile("dogs?")

In [60]:
pattern.findall(txt)

['dogs', 'dog', 'dogs']

### Example 2

Find all filenames starting with `file` and ending with `.txt` in the given text.

In [61]:
txt = """
file1.txt
file_one.txt
file.txt
fil.txt
file.xml
file-1.txt
"""

In [62]:
pattern = re.compile("file[\w-]*\.txt")

In [63]:
pattern.findall(txt)

['file1.txt', 'file_one.txt', 'file.txt', 'file-1.txt']

### Example 3

Find all filenames starting with `file` followed by 1 or more digits and ending with `.txt` in the given text.

In [64]:
txt = """
file1.txt
file_one.txt
file09.txt
fil.txt
file23.xml
file.txt
"""

In [65]:
pattern = re.compile("file\d+\.txt")

In [66]:
pattern.findall(txt)

['file1.txt', 'file09.txt']

We can use the curly brackets syntax here with these modifications:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Syntax</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>{n}</td>
    <td>The previous character is repeated exactly n times.</td>
</tr>
    
<tr>
    <td>{n,}</td>
    <td>The previous character is repeated at least n times.</td>
</tr>

<tr>
    <td>{,n}</td>
    <td>The previous character is repeated at most n times.</td>
</tr>

<tr>
    <td>{n,m}</td>
    <td>The previous character is repeated between n and m times (both inclusive).</td>
</tr>
</tbody>
</table>

### Example 4

Find years in the given text.


In [67]:
txt = """
The first season of Indian Premiere League (IPL) was played in 2008. 
The second season was played in 2009 in South Africa. 
Last season was played in 2018 and won by Chennai Super Kings (CSK).
CSK won the title in 2010 and 2011 as well.
Mumbai Indians (MI) has also won the title 3 times in 2013, 2015 and 2017.
"""

In [68]:
pattern = re.compile("\d{4}")

In [69]:
pattern.findall(txt)

['2008', '2009', '2018', '2010', '2011', '2013', '2015', '2017']

### Example 5

In the given text, filter out all 4 or more digit numbers.

In [70]:
txt = """
123143
432
5657
4435
54
65111
"""

In [71]:
pattern = re.compile("\d{4,}")

In [72]:
re.findall(pattern, txt)

['123143', '5657', '4435', '65111']

### Example 6

Write a pattern to validate telephone numbers.

Telephone numbers can be of the form: `555-555-5555`, `555 555 5555`, `5555555555`

In [73]:
txt = """
555-555-5555
555 555 5555
5555555555
"""

In [74]:
pattern = re.compile("\d{3}[-\s]?\d{3}[-\s]?\d{4}")

In [75]:
pattern.findall(txt)

['555-555-5555', '555 555 5555', '5555555555']

# Greedy Behaviour

In [76]:
txt = """<html><head><title>Title</title>"""

In [77]:
pattern = re.compile("<.*>")

In [78]:
pattern.findall(txt)

['<html><head><title>Title</title>']

In above example, one may expect to get 4 matches, i.e. `<html>`, `<head>`, `<title>` and `</title>`. Instead, we get the longest match, i.e. `<html><head><title>Title</title>`.

This particular behaviour (to find longest match) is called **greedy** behaviour.

> The greedy behavior of the quantifiers is applied by default in the quantifiers. A greedy quantifier will try to match as much as possible to have the biggest match result possible.

# Non-Greedy behaviour

The **non-greedy** (or **reluctant**) behaviour can be requested by adding an extra question mark to the quantifier.

For example, `??`, `*?` or `+?`. 

> A quantifier marked as reluctant will behave like the exact opposite of the greedy ones. They will try to have the smallest match possible.

In [79]:
pattern = re.compile("<.*?>")

In [80]:
pattern.findall(txt)

['<html>', '<head>', '<title>', '</title>']

# Boundary Matchers

Consider a scenario where you want to find all occurances of `and`, `or` and `the` in the given text.

In [81]:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, 
remaining essentially unchanged. 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, 
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
"""

In [82]:
pattern = re.compile("and|or|the")

In [83]:
pattern.findall(txt)

['or',
 'the',
 'and',
 'or',
 'the',
 'and',
 'the',
 'and',
 'the',
 'the',
 'the',
 'or',
 'and',
 'or',
 'or']

In [187]:
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m


### What is the solution?

Solution is to use this pattern:

`\b(and|or|the)\b`

where `\b` is a metacharacter that matches at a position that is called a **word boundary**. 

Such identifiers that correspond to a particular position inside of the input are called **Boundary Matchers**.

**Note:** Since `\b` is also an escape sequence for strings in Python, we need to escape it using `\`, i.e. `\\b`,  in order to treat it like a metacharacter for regex matching.

In [85]:
pattern = re.compile("\\b(and|or|the)\\b")

In [188]:
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m


Here is a table which shows the list of all boundary matchers available in Python:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Matcher</th>
    <th>Description</th>
</thead>
    
<tbody>
<tr>
    <td>^</td>
    <td>Matches at the beginning of a line</td>
</tr>
    
<tr>
    <td>$</td>
    <td>Matches at the end of a line</td>
</tr>

<tr>
    <td>\b</td>
    <td>Matches a word boundary</td>
</tr>

<tr>
    <td>\B</td>
    <td>Matches the opposite of \b. Anything that is not a word boundary</td>
</tr>

<tr>
    <td>\A</td>
    <td>Matches the beginning of the input</td>
</tr>

<tr>
    <td>\Z</td>
    <td>Matches the end of the input</td>
</tr>
</tbody>
</table>

### Example 1

Consider a scenario where we want to find all the lines in the given text which **start** with the pattern `Name:`.

In [87]:
txt = """
Name:
Age: 0
Roll No.: 15
Grade: S

Name: Ravi
Age: -1
Roll No.: 123 Name: ABC
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G
"""

In [88]:
pattern = re.compile("^Name: \w+", flags=re.M)

In [89]:
pattern.findall(txt)

['Name: Ravi', 'Name: Ram']

> `re.M` (short for `re.MULTILINE`) is a flag which is used to make begin/end `(^, $)` consider each line.

### Example 2

Find all the sentences which do not end with a full stop (`.`) in the given text.

In [90]:
txt = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages
More recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."""

In [91]:
pattern = re.compile("^.*[^\.]$", flags=re.M)

In [92]:
pattern.findall(txt)

["Lorem Ipsum has been the industry's standard dummy text ever since the 1500s!",
 'It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages']

In [189]:
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m


# Split using RegEx

> In almost every language, you can find the split operation in strings. The big difference is that the split in the `re` module is more powerful due to which you can use a regex. So, in this case, the string is split based on the matches of the pattern.

### `split(string[, maxsplit])`

- Every pattern object has a `split()` method which splits the input string at all positions where a match is found.

- `maxsplit` is an optional argument (default value 0) which specifies the max no. of splits that can take place. `0` value means there is no limit on the no. of splits.

- Pattern match is not included in any of the substrings obtained after splitting.

#### Example 1

Let us try to split a string to get individual lines in it.

In [94]:
txt = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated."""

In [95]:
pattern = re.compile("\n")

In [96]:
pattern.split(txt)

['Beautiful is better than ugly.',
 'Explicit is better than implicit.',
 'Simple is better than complex.',
 'Complex is better than complicated.']

#### Example 2

Let us try one more example in which we want to get all the words in the given text.

In [97]:
pattern = re.compile("\W")

In [98]:
list(filter(lambda x: x!= '',  pattern.split(txt))) 

['Beautiful',
 'is',
 'better',
 'than',
 'ugly',
 'Explicit',
 'is',
 'better',
 'than',
 'implicit',
 'Simple',
 'is',
 'better',
 'than',
 'complex',
 'Complex',
 'is',
 'better',
 'than',
 'complicated']

#### Example 3

What is we want only first 3 words? We need to split only 3 times in this case, which can be done by setting the value of `maxsplit` as 3.

In [99]:
pattern.split(txt, maxsplit=3)

['Beautiful',
 'is',
 'better',
 'than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.']

# Substitution

Now, we are going to look at a method which will replace all the **leftmost non-overlapping occurrences** of a pattern in a given string and return the new string as result.

### `sub(repl, string[, count=0])`

- `repl` is the replacement string which gets substituted in the place of match

- `string` is the input text on which substitution takes place.

- `count` is an optional argument (default is 0) which specifies the max no. of substitutions that can take place.  0 means there is no limit on substitution count.


Let us consider a case where we want to replace all occurances of numbers with a `-` in the given text.

In [100]:
txt = "100 cats, 23 dogs, 3 rabbits"

In [101]:
pattern = re.compile("\d+")

### `subn(repl, string[, count=0])`

- Returns the substituted string as well as the no. of substitutions.

- Can be thought of as a utility function over `sub()`.

In [102]:
pattern.subn("-", txt)

('- cats, - dogs, - rabbits', 3)

# Compilation Flags

- When compiling a pattern string into a pattern object, it's possible to **modify the standard behavior of the patterns** using **Compilation Flags**.

- Multiple compilation flags can be combined using the bitwise OR "|".

Here is a list of all the complation flags:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Syntax</th>
    <th>Meaning</th>
</thead>
    
<tbody>
<tr>
    <td>re.IGNORECASE or re.I</td>
    <td>ignore case.</td>
</tr>

<tr>
    <td>re.MULTILINE or re.M</td>
    <td>make begin/end boundary matchers (^, $) consider each line.</td>
</tr>

<tr>
    <td>re.DOTALL or re.S</td>
    <td>make . match newline too.</td>
</tr>

<tr>
    <td>re.UNICODE or re.U</td>
    <td>make {\w, \W, \b, \B} follow Unicode rules.</td>
</tr>

<tr>
    <td>re.LOCALE or re.L</td>
    <td>make {\w, \W, \b, \B} follow locale.</td>
</tr>

<tr>
    <td>re.ASCII or re.A</td>
    <td>make {\w, \W, \b, \B} perform ASCII-only matching.</td>
</tr>

<tr>
    <td>re.VERBOSE or re.X</td>
    <td>allow comment in regex.</td>
</tr>

<tr>
    <td>re.DEBUG</td>
    <td>get information about the compilation pattern.</td>
</tr>
</tbody>
</table>

Let's go through each one of them one by one.

## 1. re.IGNORECASE or re.I

This flag makes a regex pattern case-insensitive.


Let's check out an example to find all occurances of `the` and `The` in the given text.

In [103]:
txt = """
The best thing about regex is that it makes the task of string manipulation so easy.
"""

In [104]:
pattern = re.compile("the", flags=re.I)

In [105]:
pattern

re.compile(r'the', re.IGNORECASE|re.UNICODE)

In [190]:
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m


## 2. re.MULTILINE or re.M

This flag is used to make begin/end boundary matchers (`^`, `$`) consider each line of the given text.


Let's check out an example to find all lines starting with `A`. 

In [107]:
txt = """
A man was crossing the road.
Suddenly, a car passed before him in a very high speed.
He was terrified
And shocked.
"""

In [108]:
pattern = re.compile("^A.+", flags=re.M)

In [191]:
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m


## 3. re.DOTALL or re.S

The `.` metacharacter matches everything except newline character. If we want to make `.` match newline too, we have to set this flag.

Let's consider an examle to match all the text after (and including) `car`.

In [110]:
pattern = re.compile("car.+", flags=re.S)

In [192]:
highlight_regex_matches(pattern, txt)

love regex or hate regex, can't ignore [43m[1mregex[0m


## 4. re.UNICODE or re.U

Using this flag, we can make the pattern characters `{\w, \W, \b, \B}` dependent on the Unicode character properties database.

> re.UNICODE is the default flag in Python 3 regex patterns.

Let's consider an example where we try to work on hindi language.

In [112]:
txt = "লাইজ ইজ বিউটিফুল।"

In [113]:
pattern = re.compile("\w+")

In [114]:
pattern.findall(txt)

['ল', 'ইজ', 'ইজ', 'ব', 'উট', 'ফ', 'ল']

In [115]:
import regex
pattern = regex.compile("\w+")

In [116]:
pattern.findall(txt)

['লাইজ', 'ইজ', 'বিউটিফুল']

## 5. re.ASCII or re.A

This flag will make the word pattern `{\w, \W}` and boundary pattern `{\b, \B}` perform ASCII-only matching, i.e. only A-Z, a-z, 0-9 will be considered alphanumeric characters. 

Let us see an example below:

In [117]:
chars =  ''.join(chr(i) for i in range(256))

In [118]:
print(chars)

 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


In [119]:
pattern = re.compile("\w")

In [193]:
highlight_regex_matches(pattern, chars)

 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


In [194]:
pattern = re.compile("\w", flags=re.A)

In [195]:
highlight_regex_matches(pattern, chars)

 	
 !"#$%&'()*+,-./[43m[1m0[0m[43m[1m1[0m[43m[1m2[0m[43m[1m3[0m[43m[1m4[0m[43m[1m5[0m[43m[1m6[0m[43m[1m7[0m[43m[1m8[0m[43m[1m9[0m:;<=>?@[43m[1mA[0m[43m[1mB[0m[43m[1mC[0m[43m[1mD[0m[43m[1mE[0m[43m[1mF[0m[43m[1mG[0m[43m[1mH[0m[43m[1mI[0m[43m[1mJ[0m[43m[1mK[0m[43m[1mL[0m[43m[1mM[0m[43m[1mN[0m[43m[1mO[0m[43m[1mP[0m[43m[1mQ[0m[43m[1mR[0m[43m[1mS[0m[43m[1mT[0m[43m[1mU[0m[43m[1mV[0m[43m[1mW[0m[43m[1mX[0m[43m[1mY[0m[43m[1mZ[0m[\]^[43m[1m_[0m`[43m[1ma[0m[43m[1mb[0m[43m[1mc[0m[43m[1md[0m[43m[1me[0m[43m[1mf[0m[43m[1mg[0m[43m[1mh[0m[43m[1mi[0m[43m[1mj[0m[43m[1mk[0m[43m[1ml[0m[43m[1mm[0m[43m[1mn[0m[43m[1mo[0m[43m[1mp[0m[43m[1mq[0m[43m[1mr[0m[43m[1ms[0m[43m[1mt[0m[43m[1mu[0m[43m[1mv[0m[43m[1mw[0m[43m[1mx[0m[43m[1my[0m[43m[1mz[0m{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´

## 6. re.VERBOSE or re.X

This flag changes the regex syntax, to allow you to add annotations in regex. 

- Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash.

- When a line contains a # neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

In [123]:
txt = """
This is a sample text123
"""

In [124]:
pattern = re.compile("\w +")

In [125]:
pattern.findall(txt)

['s ', 's ', 'a ', 'e ']

In [126]:
pattern = re.compile("\w +  # find all words", flags=re.X)

In [127]:
pattern.findall(txt)

['This', 'is', 'a', 'sample', 'text123']

## 7. re.DEBUG

This flag when set, gives some information about the compilation pattern.

In [128]:
pattern = re.compile("\b[a-e7-9]+\b", flags=re.DEBUG)

LITERAL 8
MAX_REPEAT 1 MAXREPEAT
  IN
    RANGE (97, 101)
    RANGE (55, 57)
LITERAL 8

 0. INFO 8 0b1 3 MAXREPEAT (to 9)
      prefix_skip 1
      prefix [0x8] ('\x08')
      overlap [0]
 9: LITERAL 0x8 ('\x08')
11. REPEAT_ONE 13 1 MAXREPEAT (to 25)
15.   IN 8 (to 24)
17.     RANGE 0x61 0x65 ('a'-'e')
20.     RANGE 0x37 0x39 ('7'-'9')
23.     FAILURE
24:   SUCCESS
25: LITERAL 0x8 ('\x08')
27. SUCCESS


# Grouping

> Frequently you need to obtain more information than just whether the regex pattern matched or not.

By placing part of a regular expression inside round brackets or parentheses `(`, `)`, you can **group that part** of the regex pattern together.

### Applications of grouping:

#### 1. apply a quantifier to the entire group.

For example, `(ab)+` will match one or more repetitions of `ab`.

In [129]:
txt = "abbbbbabbbb"

In [130]:
pattern1 = re.compile("ab+")
pattern2 = re.compile("(ab)+")

In [196]:
highlight_regex_matches(pattern1, txt)

love regex or hate regex, can't ignore regex


In [197]:
highlight_regex_matches(pattern2, txt)

love regex or hate regex, can't ignore regex


#### 2. restrict alternation to part of the regex.

For example, `my name is ram|sam` will match `my name is ram` and `sam` whereas `my name is (ram|sam)` will match `my name is ram` and `my name is sam`.

In [133]:
txt = """
my name is ram
my name is sam
"""

In [134]:
pattern1 = re.compile("my name is ram|sam")
pattern2 = re.compile("my name is (ram|sam)")

In [198]:
highlight_regex_matches(pattern1, txt)

love regex or hate regex, can't ignore regex


In [199]:
highlight_regex_matches(pattern2, txt)

love regex or hate regex, can't ignore regex


#### 3. capture the text matched by group.

- Groups indicated with `(`, `)` also capture the **starting** and **ending** index of the text that they match.

- Groups can be retrieved by passing an argument to `group()`, `start()`, `end()`, and `span()` of the `Match` object. 

- Groups are numbered starting with `0`. 

- Group `0` is always present; it captures the whole regex pattern, so all `Match` object methods have group `0` as their default argument.

Consider an example where we want to parse a date and determine day, month and year.

In [137]:
txt = "12/02/2019" 

In [138]:
pattern = re.compile("(\d{2})\/(\d{2})\/(\d{4})")

In [139]:
match = pattern.match(txt)

In [140]:
# group 0: matches entire regex pattern
match.group(0)

'12/02/2019'

In [141]:
day, month, year = match.groups()

In [142]:
day, month, year

('12', '02', '2019')

Let's try one more example of group capturing. 

In the given text, find all the patterns with `Name: <some-name>` and extract `<some-name>`. 

In [143]:
txt = """
Name: Nikhil
Age: 0
Roll No.: 15
Grade: S

Name: Ravi
Age: -1
Roll No.: 123
Grade: K

Name: Ram
Age: N/A
Roll No.: 1
Grade: G
"""

In [144]:
pattern = re.compile("Name: (.+)\n")

In [145]:
pattern.findall(txt)

['Nikhil', 'Ravi', 'Ram']

> Parentheses cannot be used inside character classes, at least not as metacharacters. When you put a parenthesis in a character class, it is treated as a literal character. So the regex `[(a)b]` matches `a`, `b`, `(`, and `)`.

# Backreferencing

**Backreferences** in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. 

> For example, `\1` will succeed if the exact contents of group `1` can be found at the current position, and fails otherwise.

### Example 1

Consider a scenario where we want to find all the duplicated words in the given text.

In [146]:
txt = """
hello hello
how are you
bye bye
"""

In [147]:
pattern = re.compile(r"(\w+) \1")

In [148]:
pattern.findall(txt)

['hello', 'bye']

### Example 2

Consider a scenario where we want to find all dates with the format `dd/mm/yyy` and change them to `yyyy-mm-dd` format. 

In [149]:
txt = """
today is 23/02/2019.
yesterday was 22/02/2019.
tomorrow is 24/02/2019.
"""

In [150]:
pattern = re.compile("(\d{2})\/(\d{2})\/(\d{4})")

In [151]:
newtxt = pattern.sub(r"\3-\2-\1", txt)

In [152]:
print(newtxt)


today is 2019-02-23.
yesterday was 2019-02-22.
tomorrow is 2019-02-24.



> Backreferences, too, cannot be used inside a character class. The `\1` in a regex like `(a)[\1b]` is either an error or a needlessly escaped literal 1. 

# Named Groups

> Using numbers to refer to groups can be tedious and confusing, and the worst thing is that it doesn't allow you to give meaning or context to the group. That's why we have named groups.

Instead of referring to groups by numbers, groups can be referenced by a name. Such a group is called a **named group**.

- The syntax for a named group is one of the Python-specific extensions: `(?P<name>...)`  where `name` is, obviously, the name of the group. 

- Named groups behave exactly like capturing groups, and additionally associate a name with a group.

- Here is a table which shows three different ways to refer to named groups:
    
<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Use</th>
    <th>Syntax</th>
</thead>
    
<tbody>
<tr>
    <td>Inside a pattern</td>
    <td>(?P=name)</td>
</tr>
    
<tr>
    <td>In the repl string of the sub operation</td>
    <td>\g&lt;name&gt;</td>
</tr>

<tr>
    <td>In any of the operations of the MatchObject</td>
    <td>match.group('name')</td>
</tr>
</tbody>
</table>

### Example 1

Consider a scenario where we want to extract the first name and last name of a person.

In [153]:
txt = "Nahidul Islam"

In [154]:
pattern = re.compile("(?P<first>\w+) (?P<last>\w+)")

In [155]:
match = pattern.match(txt)

In [156]:
match.group('first')

'Nahidul'

In [157]:
match.group('last')

'Islam'

In [158]:
match.groupdict()

{'first': 'Nahidul', 'last': 'Islam'}

### Example 2

Now consider the scenario where we want to swap first name and last name in above example.

In [159]:
pattern.sub("\g<last> \g<first>", txt)

'Islam Nahidul'

### Example 3

Consider a scenario where we want to check if a person has same first and last name.

In [160]:
txt = "Jhonson Jhonson"

In [161]:
pattern = re.compile("(?P<first>\w+) (?P=first)")

In [162]:
pattern.findall(txt)

['Jhonson']

# Non-Capturing Groups

> There are cases when we want to use groups, but we're not interested in extracting the information, i.e. capturing the matched text inside paranthesis only. An example is **alteration**.

Let's consider an example where we want to find the strings `i love cats` or `i love dogs` in the given text.

In [163]:
txt = """
i love cats
i love dogs
"""

In [164]:
pattern = re.compile("i love (cats|dogs)")

In [165]:
pattern.findall(txt)

['cats', 'dogs']

In [166]:
for match in pattern.finditer(txt):
    print("Complete regex match (default):", match.group(0))
    print("Match captured by 1st group:", match.group(1))

Complete regex match (default): i love cats
Match captured by 1st group: cats
Complete regex match (default): i love dogs
Match captured by 1st group: dogs


As we can see, the group captured part contains only `cats` or `dogs` instead of complete sentences.

Hence, to make a group **non-capturing**, we have to use the syntax `(?:pattern)`.

In [167]:
pattern = re.compile("i love (?:cats|dogs)")

In [168]:
pattern.findall(txt)

['i love cats', 'i love dogs']

> After using the new syntax, we have the same functionality as before, but now we're saving resources and the regex is easier to maintain. Note that the group cannot be referenced.

# Zero-width assertions

- Characters which indicate positions rather than actual content are called **zero-width assertions**.


- For instance, the caret symbol (`^`) is a representation of the beginning of a line or the dollar sign (`$`) for the end of a line. 


- They effectively do assertion without consuming characters; they just return a positive or negative result of the match.


- A more powerful kind of **zero-width assertion** is **look around**, a mechanism with which it is possible to match a certain previous (**look behind**) or ulterior (**look ahead**) value to the current position.


# Look around


**Look around** is a simple mechanism which during the matching process, at the current position, looks forward (or behind, depends on type of lookaround used) to see if **some** pattern matches before continuing with the actual match.

The most important thing to understand here is that **look around** mechanism consists of 2 parts:
- **actual expression**: an expression whose match constitutes the final **result**.
- **non-consuming expression**: an expression whose match is evaluated before the actual expression, just to see if it can succeed. It is **not actually consumed** by the regex engine.
    - If the non-consuming match **succeeds**, the regex engine forgets about this non-consuming expression and starts evaluating the next character from the current position of the actual expression. 
    - If the non-consuming match **does not succeed**, we simply move to next character of the given text and repeat the whole match process again.

There are 2 main categories of **look around**  which, in turn, have 2 sub-categories each.

![](images/lookaround.png)

Let's explore each one of them one by one.

# Look ahead

**Look ahead** mechanism checks the match for a non-consuming expression **ahead** of a given pattern.


## Positive look ahead

- **Positive look ahead** will succeed if the passed non-consuming expression **does match** against the forthcoming input.

- The syntax is `A(?=B)` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's check out an example to understand the concept. Let's assume that we want to find a match for `love` in the given text only if it is followed by `regex`.

In [169]:
txt = "i love python, i love regex"

In [170]:
pattern = re.compile("love(?=\sregex)")

In [171]:
match = pattern.search(txt)

In [200]:
highlight_regex_matches(pattern, txt)

[43m[1ml[0m[43m[1mo[0m[43m[1mv[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m [43m[1mo[0m[43m[1mr[0m [43m[1mh[0m[43m[1ma[0m[43m[1mt[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m, [43m[1mc[0m[43m[1ma[0m[43m[1mn[0m'[43m[1mt[0m [43m[1mi[0m[43m[1mg[0m[43m[1mn[0m[43m[1mo[0m[43m[1mr[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m


Now, using **positive look ahead** mechanism, we consumed only 4 (index 17 to 21) characters are consumed for the match.

Let us check out another example to find all words in given text which are followed by `.` or `,`.

In [173]:
txt = "My favorite colors are red, green, and blue."

In [174]:
pattern = re.compile("\w+(?=,|\.)")

In [175]:
pattern.findall(txt)

['red', 'green', 'blue']

In [201]:
highlight_regex_matches(pattern, txt)

[43m[1ml[0m[43m[1mo[0m[43m[1mv[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m [43m[1mo[0m[43m[1mr[0m [43m[1mh[0m[43m[1ma[0m[43m[1mt[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m, [43m[1mc[0m[43m[1ma[0m[43m[1mn[0m'[43m[1mt[0m [43m[1mi[0m[43m[1mg[0m[43m[1mn[0m[43m[1mo[0m[43m[1mr[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m


## Negative look ahead

- **Negative look ahead** will succeed if the passed non-consuming expression **does not match** against the forthcoming input.

- The syntax is `A(?!B)` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's assume that we want to find a match for `love` in the given text only if it is NOT followed by `regex`.

In [177]:
txt = "i love python, i love regex"

In [178]:
pattern = re.compile("love(?!\sregex)")

In [202]:
highlight_regex_matches(pattern, txt)

[43m[1ml[0m[43m[1mo[0m[43m[1mv[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m [43m[1mo[0m[43m[1mr[0m [43m[1mh[0m[43m[1ma[0m[43m[1mt[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m, [43m[1mc[0m[43m[1ma[0m[43m[1mn[0m'[43m[1mt[0m [43m[1mi[0m[43m[1mg[0m[43m[1mn[0m[43m[1mo[0m[43m[1mr[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m


# Look behind


**Look behind** mechanism checks the match for a non-consuming expression **behind** a given pattern.


## Positive look behind

- **Positive look behind** will succeed if the passed non-consuming expression **does match** against the forthcoming input.

- The syntax is `(?<=B)A` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's check out an example to understand the concept. Let's assume that we want to find a match for `regex` in the given text only if it is succeeded by `love` or `hate`.

In [180]:
txt = "love regex or hate regex, can't ignore regex"

In [181]:
pattern = re.compile("(?<=(love|hate)\s)regex")

In [203]:
highlight_regex_matches(pattern, txt)

[43m[1ml[0m[43m[1mo[0m[43m[1mv[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m [43m[1mo[0m[43m[1mr[0m [43m[1mh[0m[43m[1ma[0m[43m[1mt[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m, [43m[1mc[0m[43m[1ma[0m[43m[1mn[0m'[43m[1mt[0m [43m[1mi[0m[43m[1mg[0m[43m[1mn[0m[43m[1mo[0m[43m[1mr[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m


## Negative look behind

- **Negative look behind** will succeed if the passed non-consuming expression **does not match** against the forthcoming input.

- The syntax is `(?<!B)A` where `A` is the **actual expression** and `B` is the **non-consuming expression**. 


Let's assume that we want to find a match for `regex` in the given text if it is not followed by `love` or `hate`.

In [183]:
pattern = re.compile("(?<!(love|hate)\s)regex")

In [204]:
highlight_regex_matches(pattern, txt)

[43m[1ml[0m[43m[1mo[0m[43m[1mv[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m [43m[1mo[0m[43m[1mr[0m [43m[1mh[0m[43m[1ma[0m[43m[1mt[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m, [43m[1mc[0m[43m[1ma[0m[43m[1mn[0m'[43m[1mt[0m [43m[1mi[0m[43m[1mg[0m[43m[1mn[0m[43m[1mo[0m[43m[1mr[0m[43m[1me[0m [43m[1mr[0m[43m[1me[0m[43m[1mg[0m[43m[1me[0m[43m[1mx[0m
