## Regex - Functions and Patterns

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. There are various characters, which would have special meaning when they are used in regular expression. To avoid any confusion while dealing with regular expressions, we would use Raw Strings as **r'expression'**.

Regex in Python<br>
<ul><li>Functions</li><li>Flags</li><li>Regex Patterns</li>

### Functions
<hr>

#### re.compile
`re.compile(pattern, flags = 0)` compiles the pattern into regex object. More useful when there are multiple regex patterns to match.

In [1]:
import re

text = "Hello world."
search = re.compile(text)
print(search)

re.compile('Hello world.')


In [2]:
print(type(search))

<class 're.Pattern'>


<hr>

#### re.search
`re.search(pattern, string, flags)` - It scans through the entire string for the pattern and returns a match object. It returns `None` if there is no match.

In [4]:
text = "NLP is Natural Language Processing."
search_word1 = "NLP"
search_word2 = "sample"
search = re.search(search_word1, text)

print(f"The original text is - {text}")
print(f"Match for search word 1 - {search_word1} = {search}")
search = re.search(search_word2, text)
print(f"Match for search word 2 - {search_word2} = {search}")

The original text is - NLP is Natural Language Processing.
Match for search word 1 - NLP = <re.Match object; span=(0, 3), match='NLP'>
Match for search word 2 - sample = None


<hr>

#### re.match()
`re.match(pattern, string, flags)` - It checks if the pattern is present at the beginning of the string and returns a match object. It returns `None` if there is no match.

In [5]:
text = "NLP is Natural Language Processing."
print(f"The original text is - {text}")

search_word1 = "NLP"
search_word2 = "sample"
search_word3 = "Natural"

search = re.match(search_word1, text)
print(f"Match for search word 1 - {search_word1} = {search}")

search = re.match(search_word2, text)
print(f"Match for search word 2 - {search_word2} = {search}")

search = re.match(search_word3, text)
print(f"Match for search word 3 - {search_word3} = {search}")


The original text is - NLP is Natural Language Processing.
Match for search word 1 - NLP = <re.Match object; span=(0, 3), match='NLP'>
Match for search word 2 - sample = None
Match for search word 3 - Natural = None


<hr>

##### Using `match` when we have multiple lines in our text

In [10]:
string_with_newlines = """something
someotherthing"""

print(f"Our original string:\n{string_with_newlines}\n")

print("Matching 'some' with our text input")
print (re.match('some', string_with_newlines)) # matches
print()

print("Matching 'someother' with our text input")
print (re.match('someother', string_with_newlines)) # won't match
print()

print("Matching '^someother' with our text input and using re.MULTILINE flag")
print (re.match('^someother', string_with_newlines,re.MULTILINE)) # also won't match
print()

print("Searching 'someother' with our text input")
print (re.search('someother', string_with_newlines)) # finds something
print()

print("Searching '^someother' with our text input and using re.MULTILINE flag")
print (re.search('^someother', string_with_newlines,re.MULTILINE)) # also finds something

Our original string:
something
someotherthing

Matching 'some' with our text input
<re.Match object; span=(0, 4), match='some'>

Matching 'someother' with our text input
None

Matching '^someother' with our text input and using re.MULTILINE flag
None

Searching 'someother' with our text input
<re.Match object; span=(10, 19), match='someother'>

Searching '^someother' with our text input and using re.MULTILINE flag
<re.Match object; span=(10, 19), match='someother'>


In [None]:
m = re.compile('thing$', re.MULTILINE)

print (m.match(string_with_newlines)) # no match
print (m.match(string_with_newlines, pos=4)) # matches
print (m.search(string_with_newlines, re.MULTILINE)) # also matches
(re.MULTILINE, re.I, re.X, re.A)

#### Notes on re.match and re.search
<ul>
    <li>re.match and re.search both take three arguments - pattern, string and flag</li>
    <li>If the pattern is a compiled pattern, then we cannot give flags</li>
    <li>The pattern.match and pattern.search takes one, two or three arguments - 
        <ul>
            <li>string</li>
            <li>start pos</li>
            <li>end pos</li>
        </ul>
    </li>
    <li>pattern.match and pattern.search does not take flags as an argument. `pattern` is not of type string, but obtained by compiling using re.compile. Just as in the case with re.match/search where compiled pattern will not accept flags, pattern.match/search will also not accept flags</li>
    <li>In the above cell, re.FLAG returns an integer, which is then used as one of the arguments for start pos and end pos.</li>
</ul>

[Differences between re.match and re.search](https://stackoverflow.com/questions/180986/what-is-the-difference-between-re-search-and-re-match)

<hr>

##### re.fullmatch()
`re.fullmatch(pattern, string, flag)` - If the entire string matches the regex pattern it returns the match object, elsse returns `None`

In [None]:
print(re.fullmatch("Hello", "Hello world"))

In [None]:
print(re.fullmatch("Hello world", "Hello world"))

<hr>

##### re.split()
`re.split(pattern, string, flags)` - Splits the string by the occurence of pattern. `string.split()` is more useful for splitting text using constant string. 

In [None]:
text = "One:two::t h r e e:::fourth field"
re.split(":+",text)

In [None]:
text.split(":")

In [None]:
re.split(r"\W+",text)

re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

In [None]:
re.split('([a-f]+)', '0a3B9', flags=re.IGNORECASE)

In [None]:
print(re.split(r'(\W+)', '...words, words...'))
print(re.split(r'\W+', '...words, words...'))

<hr>

##### re.findall()
`re.findall(pattern, string, flags)` - Return all matches of pattern in string, as a list of strings. 

In [None]:
text = """
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the 
pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the 
remainder of the string is returned as the final element of the list.
"""

In [None]:
re.findall(r'\w*split',text, re.I)

In [None]:
print(re.findall(r'(\w*)(t) ',text, re.I)[:5])
print(re.findall(r'(\w*)t ',text, re.I)[:5])

 If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

In [None]:
re.findall('([a-z]+)\s(\d+)', 'abcdefg123 and again test 456.')

<hr>

##### re.finditer()
`re.finditer(pattern, string, flags)` - Returns an iterator yielding matched objects

In [None]:
z = re.finditer(r'\w*split',text, re.I)
print(z)
for item in z:
    print(item)

In [None]:
for i in re.finditer(r'(\w*)(t) ',text, re.I):
    print(i.group(0),i.group(1),i.group(2))
print()
print(*re.finditer(r'(\w*)t ',text, re.I), sep="\n")
print()
for i in re.finditer(r'(\w*)t ',text, re.I):
    print(i.group(0),i.group(1))


<hr>

##### re.sub()
` re.sub(pattern, repl, string, count=0, flags=0)` - Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

In [None]:
text = "Return all non-overlapping matches of pattern in string, as a list of strings."
print(re.sub(r'string\w*', "text",text))

`repl` can be a function

In [None]:
def my_replace(match):
    match1 = match.group(1)
    match2 = match.group(2)
    match2 = match2.replace('@', '')
    return u"{0:0.{1}f}".format(float(match1), int(match2))

string = 'The first number is 14.2@1, and the second number is 50.6@4.'
result = re.sub(r'([0-9]+.[0-9]+)(@[0-9]+)', my_replace, string)

print(result)

<hr>

##### re.subn()
`re.subn(pattern, repl, string,count, flags)` - Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).

In [None]:
text = "Return all non-overlapping matches of pattern in string, as a list of strings."
print(re.subn(r'string\w*', "text",text))

In [None]:
string = 'The first number is 14.2@1, and the second number is 50.6@4.'
result = re.subn(r'([0-9]+.[0-9]+)(@[0-9]+)', my_replace, string)

print(result)

<hr>

##### re.escape()
`re.escape(pattern)` - Escape special characters in pattern.

In [None]:
print(re.escape('http://www.python.org'))

In [None]:
z= '['+re.escape(r'\ a.*$')+']'
re.findall(z, "This is a sample string$.")

In [None]:
import string
legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:/"
print('[%s]+' % re.escape(legal_chars))

<hr>

##### re.purge()
`re.purge()` - Clear the regular expression cache.

[Why should we use re.purge](https://stackoverflow.com/questions/54773313/why-should-we-use-re-purge-in-python-regular-expression)

<hr>

In [None]:
try:
    print(re.search('(t', "Text"))
except Exception as e:
    print(e)

In [None]:
pattern = re.compile(r"(?P<Group>g\w+)(?P<of>\sof)", re.IGNORECASE)
my_string = """
Returns one or more subgroups of the match. 
If there is a single argument, the result is a single string; 
if there are multiple arguments, the result is a tuple with one item per argument. 
Without arguments, group1 defaults to zero (the whole match is returned). 
If a groupN argument is zero, the corresponding return value is the entire matching string; 
if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. 
If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. 
If a group is contained in a part of the pattern that did not match, the corresponding result is None. 
If a group is contained in a part of the pattern that matched multiple times, the last match is returned."""
print(pattern)

In [None]:
print(pattern.search(my_string))
print(pattern.search(my_string, 20))
print(pattern.search(my_string, 9, 15))

In [None]:
print(pattern.match(my_string))
print(pattern.match(my_string, 20))
print(pattern.match(my_string, 24))
print(pattern.match(my_string, 24, 26))

In [None]:
print(pattern.flags)
print(pattern.groups)
print(pattern.groupindex)
print(pattern.pattern)

In [None]:
match = pattern.search(my_string)
print(match.expand(r"matched term - \2"))
print(match.groups())
print(match)
print(match.group())
print(match.group(0))
print(match.group(1))
print(match.group(2))

In [None]:
match[0],match[1],match[2]

In [None]:
match.groupdict()

In [None]:
print(match.start(), match.end())
print(match.start(1), match.end(1))
print(match.start(2), match.end(2))

In [None]:
print(match.span())
print(match.span(1))
print(match.span(2))

In [None]:
match1 = pattern.match(my_string, 24)
match1.pos

In [None]:
match.lastindex

In regex groups are captured using (). In python re the lastindex holds the last capturing group. Since two groups were matched in `match` the lastindex is `2`<hr>[Source](https://stackoverflow.com/questions/22489243/re-in-python-lastindex-attribute)

In [None]:
match.lastgroup

<hr>

#### Flags
##### re.A / re.ASCII
Performs only ASCII matching instead of unicode matching.


In [None]:
print(re.findall(r"\w+","ŵ something, some word"))
print(re.findall(r"\w+","ŵ something, some word", re.A))

##### re.I/ re.IGNORECASE

In [None]:
print(re.findall(r"[a-z]+","Hello world"))
print(re.findall(r"[a-z]+","Hello world", re.I))

##### re.LOCALE/re.L

In [None]:
# pattern = re.compile(r'类'.encode())
# pattern_str = r'类'.encode()
# t = "PROCESS：类型：关爱积分[NOTIFY]   交易号：2012022900000109   订单号：W12022910079166    交易金额：0.01元    交易状态：true 2012-2-29 10:13:08"

# print(*re.findall(pattern,t.encode()))
# print(*re.findall(pattern_str,t.encode(), re.L))

##### re.M/ re.MULTILINE

In [None]:
text = """\n1@ ake \\w, \\W, \\b, \\B and case-insensitive matching dependent on the current locale. 
This flag can be used only with bytes patterns. 
The use of this flag is discouraged as the locale mechanism is very unreliable, 
it only handles one “culture” at a time, and it only works with 8-bit locales. 
Unicode matching is already enabled by default in Python 3 for Unicode 
(str) patterns, and 
it is able to handle different locales/languages. Corresponds to the inline flag (?L)."""

In [None]:
pattern = r"^[a-zA-Z]+"
print(re.search(pattern,text))
print(re.search(pattern,text,re.M))

##### re.DOTALL/ re.S

In [None]:
pattern = r"."
print(re.search(pattern,text))
print(re.search(pattern,text,re.DOTALL))

##### re.x/ re.VERBOSE

In [None]:
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)

In [None]:
print(re.search(a, "Test 123.23"))

#### Patterns
<ul>
    <li>. - Matches all character except NEWLINE character. is DOTALL is active, it will match newline characters as well</li>
    <li>^ - Matches the start of the string. In MULTILINE mode it even matches immediately after newline.</li>
    <li>$ - Matches the end of the string.</li>
    <li>* - Matches zero or more character.</li>
    <li>+ - Matches one or more characters.</li>
    <li>? - Matches zero or one character.</li>
    <li>\d - Matches digits.</li>
    <li>\D - Matches non digits.</li>
    <li>\s - Matches space.</li>
    <li>\S - Matches non space.</li>
    <li>\w - Matches words [a-zA-Z0-9_].</li>
    <li>\W - Matches non words [^a-zA-Z0-9_].</li>
    <li>| - (a|b) -> a or b</li>
    <li>{m} - matches exactly m times</li>
    <li>{m,} - matches m or more times</li>
    <li>{m,n} - matches m to n times</li>
    <li>{m,n}? - matches m to n times, but matches as few as possible.</li>
    <li>[] - matches any character inside square brackets</li>
    <li>() - matches the regex in () and indicates the start and end of a group.</li>
    <li>(?aiLmsux) - one or more letters from the set. It sets the corresponding flag</li>
    <li>(?:...) - whatever regex is in the group will not be retrieved after match.</li>
    <li>(?aiLmsux-imsx:...) - unsure</li>
    <li>(?P<name>...) - name the group</li>
</ul>
    
        

In [None]:
text = """\n1@ ake \\w, \\W, \\b, \\B and case-insensitive matching dependent on the current locale. 
This flag can be used only with bytes patterns. 
The use of this flag is discouraged as the locale mechanism is very unreliable, 
it only handles one “culture” at a time, and it only works with 8-bit locales. 
 Unicode matching is already enabled by default in Python 3 for Unicode 
(str) patterns, and 
it is able to handle different locales/languages. Corresponds to the inline flag (?L)."""
t1 = "Hello World"

In [None]:
print(re.match('.',text))
print(re.match('.',t1))

In [None]:
print(re.findall("^\w",t1))

In [None]:
print(re.findall("\w$",t1))

In [None]:
print(re.findall("\w*",t1))

In [None]:
print(re.findall("\w+",t1))

In [None]:
print(re.findall("\w?",t1))

In [None]:
print(re.findall("\w*?",t1))

In [None]:
print(re.findall("\w+?",t1))

In [None]:
print(re.findall("\w{2}",t1))

In [None]:
print(re.findall("\w{2,4}",t1))

In [None]:
print(re.findall("\w{,3}",t1))

In [None]:
print(re.findall("\w{2,4}?",t1))

In [None]:
print(re.findall("[c-i]+",t1))

In [None]:
print(re.findall("e|r",t1))

In [None]:
print(re.findall("(ll)",t1))

In [None]:
print(re.findall("(?P<name>\w+)",t1))

In [None]:
words = ["foobar","FOObar","fooBAR"]

for word in words:
    print(re.findall("(?i:foo)bar",word))
print()
print(re.search("(?i:T)h\w+",text))
print(re.search("Th\w+",text))

In [None]:
z = re.search(r"([a-z])([0-5])","Sample string. test123")
print(z.groups())
print(z)

In [None]:
z = re.search(r"(?:[a-z])([0-5])","Sample string. test123")
print(z.groups())
print(z)

In [None]:
words = ["foobar","FOObar","fooBARèÑ"]

for word in words:
    print(re.findall("(?i:foo)bar",word))
print()
print(re.search("(?u:[A-Z])\w*","fooBARèÑ"))
print(re.search("[A-Z]\w*","fooBARèÑ"))
print(re.search("(?i-s:[A-Z])\w*","fooBARèÑ"))
print(re.search("[A-Z](?a:\w*)","fooBARèÑ"))
print(re.search("[A-Z](?-i:\w*)","fooBARèÑ"))

In [None]:
z = (re.search("(?P<name>\w+)\w+(?P=name)","test a123"))
print(z.groupdict())
print(z)

In [None]:
z = (re.search("(?P<name>\w+) \w+(?P=name)","Hello world"))
print(z.groupdict())
print(z)

#### (?#...) - 
Comment - contents are ignored

In [None]:
z = (re.search("(?#some comment\w+) \w+","Hello world"))
print(z.groupdict())
print(z)

#### (?=...) - 

In [None]:
z = (re.finditer("Hello(?= world)","Hello world. Hello"))
print(*z,sep="\t")
z = (re.finditer("Hello","Hello world. Hello"))
print(*z,sep="\t")

#### (?!...) - 

In [None]:
z = (re.finditer("Hello(?! world)","Hello world. Hello123"))
print(*z,sep="\t")

#### (?<=...)

In [None]:
m = re.findall('(?<=abc)def', 'abcdef')
print(m)

In [None]:
m = re.findall(r'(?<=-)\w+', 'python-3.8')
print(m)
m = re.findall(r'(?<=-)\w+', 'p-ython-3.8')
print(m)

In [None]:
m = re.findall(r'(?<!-)\w+', 'p-ython-3.8')
print(m)

#### (?(id/name)yes-pattern|no-pattern)

In [None]:
p = r"(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|'z')"
print(re.findall(p,"<user@host.com>"))

In [None]:
p = r"(\w)?(\w+)(?(1).|$)"
s = "hello world"
print(re.findall(p,s))

In [None]:
s1 = "aaaaaa bbb cc "
print(re.findall("\w{3} ",s1))
print(re.findall("\w{3,5} ",s1))
print(re.findall("\w{,3} ",s1))

In [None]:
print(re.findall("\w{3,5}?",s1))
print(re.findall("\w{3,5}",s1))

In [None]:
p = r"(.+) \1"
s = "the the"
print(re.findall(p,s))

In [None]:
p = r"\w\A"
s = "the the"
print(re.search(p,s))

In [None]:
p = r"\bH"
s = "Hello world"
print(re.findall(p,s))

In [None]:
p1 = r"h\B"
p2 = r"e\B"
s = "the the"
print(re.findall(p1,s))
print(re.findall(p2,s))

For reference click [here](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075)