# <center>RegEx in Python</center>

![](images/memes/meme21.jpg)

# Compilation Flags

- When compiling a pattern string into a pattern object, it's possible to **modify the standard behavior of the patterns** using **Compilation Flags**.

- Multiple compilation flags can be combined using the bitwise OR "|".

Here is a list of all the complation flags:

<table style="border: 1px solid black; font-size:15px;">
<thead>
    <th>Syntax</th>
    <th>Meaning</th>
</thead>
    
<tbody>
<tr>
    <td>re.IGNORECASE or re.I</td>
    <td>ignore case.</td>
</tr>

<tr>
    <td>re.MULTILINE or re.M</td>
    <td>make begin/end boundary matchers (^, $) consider each line.</td>
</tr>

<tr>
    <td>re.DOTALL or re.S</td>
    <td>make . match newline too.</td>
</tr>

<tr>
    <td>re.UNICODE or re.U</td>
    <td>make {\w, \W, \b, \B} follow Unicode rules.</td>
</tr>

<tr>
    <td>re.LOCALE or re.L</td>
    <td>make {\w, \W, \b, \B} follow locale.</td>
</tr>

<tr>
    <td>re.ASCII or re.A</td>
    <td>make {\w, \W, \b, \B} perform ASCII-only matching.</td>
</tr>

<tr>
    <td>re.VERBOSE or re.X</td>
    <td>allow comment in regex.</td>
</tr>

<tr>
    <td>re.DEBUG</td>
    <td>get information about the compilation pattern.</td>
</tr>
</tbody>
</table>

Let's go through each one of them one by one.

## 1. re.IGNORECASE or re.I

This flag makes a regex pattern case-insensitive.


Let's check out an example to find all occurances of `the` and `The` in the given text.

In [1]:
import re
from utils import highlight_regex_matches

In [2]:
txt = """
The best thing about regex is that it makes the task of string manipulation so easy.
"""

In [3]:
pattern = re.compile("the", flags=re.I)

In [4]:
pattern

re.compile(r'the', re.IGNORECASE|re.UNICODE)

In [5]:
highlight_regex_matches(pattern, txt)


[43m[1mThe[0m best thing about regex is that it makes [43m[1mthe[0m task of string manipulation so easy.



## 2. re.MULTILINE or re.M

This flag is used to make begin/end boundary matchers (`^`, `$`) consider each line of the given text.


Let's check out an example to find all lines starting with `A`.  

In [6]:
txt = """
A man was crossing the road.
Suddenly, a car passed before him in a very high speed.
He was terrified
And shocked.
"""

In [7]:
pattern = re.compile("^A.+", flags=re.M)

In [8]:
highlight_regex_matches(pattern, txt)


[43m[1mA man was crossing the road.[0m
Suddenly, a car passed before him in a very high speed.
He was terrified
[43m[1mAnd shocked.[0m



## 3. re.DOTALL or re.S

The `.` metacharacter matches everything except newline character. If we want to make `.` match newline too, we have to set this flag.

Let's consider an examle to match all the text after (and including) `car`.

In [9]:
pattern = re.compile("car.+", flags=re.S)

In [10]:
highlight_regex_matches(pattern, txt)


A man was crossing the road.
Suddenly, a [43m[1mcar passed before him in a very high speed.
He was terrified
And shocked.
[0m


## 4. re.UNICODE or re.U

Using this flag, we can make the pattern characters `{\w, \W, \b, \B}` dependent on the Unicode character properties database.

> re.UNICODE is the default flag in Python 3 regex patterns.

Let's consider an example where we try to work on hindi language.

In [11]:
txt = "मुझे किताबें पढ़ना बहुत पसंद है।"

In [12]:
pattern = re.compile("\w+")

In [13]:
pattern.findall(txt)

['म', 'झ', 'क', 'त', 'ब', 'पढ', 'न', 'बह', 'त', 'पस', 'द', 'ह']

[Solution](https://stackoverflow.com/questions/12746458/python-unicode-regular-expression-matching-failing-with-some-unicode-characters/12747529#12747529)

In [14]:
import regex

In [15]:
pattern = regex.compile("\w+")

In [16]:
pattern.findall(txt)

['मुझे', 'किताबें', 'पढ़ना', 'बहुत', 'पसंद', 'है']

## 5. re.LOCALE or re.L

> A locale is a set of environmental variables that defines the language, country, and character encoding settings (or any other special variant preferences) for your applications.

This flag will make the word pattern `{\w, \W}` and boundary pattern `{\b, \B}`, dependent on the current locale. 

<span style="color:red;">**The use of this flag is discouraged in Python 3 as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales. Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages.**</span>


## 6. re.ASCII or re.A

This flag will make the word pattern `{\w, \W}` and boundary pattern `{\b, \B}` perform ASCII-only matching, i.e. only A-Z, a-z, 0-9 will be considered alphanumeric characters. 

Let us see an example below:

In [17]:
chars =  ''.join(chr(i) for i in range(256))

In [18]:
print(chars)

 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ


In [19]:
pattern = re.compile("\w")

In [20]:
highlight_regex_matches(pattern, chars)

 	
 !"#$%&'()*+,-./[43m[1m0[0m[43m[1m1[0m[43m[1m2[0m[43m[1m3[0m[43m[1m4[0m[43m[1m5[0m[43m[1m6[0m[43m[1m7[0m[43m[1m8[0m[43m[1m9[0m:;<=>?@[43m[1mA[0m[43m[1mB[0m[43m[1mC[0m[43m[1mD[0m[43m[1mE[0m[43m[1mF[0m[43m[1mG[0m[43m[1mH[0m[43m[1mI[0m[43m[1mJ[0m[43m[1mK[0m[43m[1mL[0m[43m[1mM[0m[43m[1mN[0m[43m[1mO[0m[43m[1mP[0m[43m[1mQ[0m[43m[1mR[0m[43m[1mS[0m[43m[1mT[0m[43m[1mU[0m[43m[1mV[0m[43m[1mW[0m[43m[1mX[0m[43m[1mY[0m[43m[1mZ[0m[\]^[43m[1m_[0m`[43m[1ma[0m[43m[1mb[0m[43m[1mc[0m[43m[1md[0m[43m[1me[0m[43m[1mf[0m[43m[1mg[0m[43m[1mh[0m[43m[1mi[0m[43m[1mj[0m[43m[1mk[0m[43m[1ml[0m[43m[1mm[0m[43m[1mn[0m[43m[1mo[0m[43m[1mp[0m[43m[1mq[0m[43m[1mr[0m[43m[1ms[0m[43m[1mt[0m[43m[1mu[0m[43m[1mv[0m[43m[1mw[0m[43m[1mx[0m[43m[1my[0m[43m[1mz[0m{|}~ ¡¢£¤¥¦§¨©[43m[1mª

In [21]:
pattern = re.compile("\w", flags=re.A)

In [22]:
highlight_regex_matches(pattern, chars)

 	
 !"#$%&'()*+,-./[43m[1m0[0m[43m[1m1[0m[43m[1m2[0m[43m[1m3[0m[43m[1m4[0m[43m[1m5[0m[43m[1m6[0m[43m[1m7[0m[43m[1m8[0m[43m[1m9[0m:;<=>?@[43m[1mA[0m[43m[1mB[0m[43m[1mC[0m[43m[1mD[0m[43m[1mE[0m[43m[1mF[0m[43m[1mG[0m[43m[1mH[0m[43m[1mI[0m[43m[1mJ[0m[43m[1mK[0m[43m[1mL[0m[43m[1mM[0m[43m[1mN[0m[43m[1mO[0m[43m[1mP[0m[43m[1mQ[0m[43m[1mR[0m[43m[1mS[0m[43m[1mT[0m[43m[1mU[0m[43m[1mV[0m[43m[1mW[0m[43m[1mX[0m[43m[1mY[0m[43m[1mZ[0m[\]^[43m[1m_[0m`[43m[1ma[0m[43m[1mb[0m[43m[1mc[0m[43m[1md[0m[43m[1me[0m[43m[1mf[0m[43m[1mg[0m[43m[1mh[0m[43m[1mi[0m[43m[1mj[0m[43m[1mk[0m[43m[1ml[0m[43m[1mm[0m[43m[1mn[0m[43m[1mo[0m[43m[1mp[0m[43m[1mq[0m[43m[1mr[0m[43m[1ms[0m[43m[1mt[0m[43m[1mu[0m[43m[1mv[0m[43m[1mw[0m[43m[1mx[0m[43m[1my[0m[43m[1mz[0m{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´

## 7. re.VERBOSE or re.X

This flag changes the regex syntax, to allow you to add annotations in regex. 

- Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash.

- When a line contains a # neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

In [23]:
txt = """
This is a sample text123
"""

In [24]:
pattern = re.compile("\w +")

In [25]:
pattern.findall(txt)

['s ', 's ', 'a ', 'e ']

In [26]:
pattern = re.compile("\w +  # find all words", flags=re.X)

In [27]:
pattern.findall(txt)

['This', 'is', 'a', 'sample', 'text123']

## 8. re.DEBUG

This flag when set, gives some information about the compilation pattern.

In [28]:
pattern = re.compile("\b[a-e7-9]+\b", flags=re.DEBUG)

LITERAL 8
MAX_REPEAT 1 MAXREPEAT
  IN
    RANGE (97, 101)
    RANGE (55, 57)
LITERAL 8


![](images/memes/meme22.jpg)