# Regular Expressions

Its pattern matching!

Source: [Complete Natural Language Processing (NLP) Tutorial in Python! (with examples)](https://www.youtube.com/watch?v=M7SWr5xObkA)

Cheat Sheet: [Here](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)

In [80]:
import re

In [81]:
start = 'ab'; end = 'cd'
fillers = [' ', '12', '#2']
prefixes = ['1','','X']
postfixes = ['','3','']

Two ways of searching. functional and OOP:
1. re.search(pattern, string)
2. pat = re.compile(pattern); pat.search(string)

* '.*' -  find between

In [82]:
pattern = start + '.*' + end
print(f'{pattern = }')
for pre, fill, post in zip(prefixes, fillers, postfixes):
    txt = pre + start + fill + end + post
    print(f'{txt = }; search:', re.search(pattern, txt))

pattern = 'ab.*cd'
txt = '1ab cd'; search: <re.Match object; span=(1, 6), match='ab cd'>
txt = 'ab12cd3'; search: <re.Match object; span=(0, 6), match='ab12cd'>
txt = 'Xab#2cd'; search: <re.Match object; span=(1, 7), match='ab#2cd'>


* [^abc] exclude internal matches. i.e ignore whitespace \s

In [96]:
pattern = re.compile(start + '[^\s]'+'.*' + end)
print(f'{pattern = }')
for pre, fill, post in zip(prefixes, fillers, postfixes):
    txt = pre + start + fill + end + post
    print(f'{txt = }; search:', pattern.search(txt))

pattern = re.compile('ab[^\\s].*cd')
txt = '1ab cd'; search: None
txt = 'ab12cd3'; search: <re.Match object; span=(0, 6), match='ab12cd'>
txt = 'Xab#2cd'; search: <re.Match object; span=(1, 7), match='ab#2cd'>


* '^' - starts with this; '$' - ends with this

In [97]:
pattern = re.compile('^' + start + '.*' + end)
print(f'{pattern = }')
for pre, fill, post in zip(prefixes, fillers, postfixes):
    txt = pre + start + fill + end + post
    print(f'{txt = }; search:', pattern.search(txt))

pattern = re.compile('^ab.*cd')
txt = '1ab cd'; search: None
txt = 'ab12cd3'; search: <re.Match object; span=(0, 6), match='ab12cd'>
txt = 'Xab#2cd'; search: None


* re.match() searches in the beginning of the string, and re.search() everywhere

In [107]:
pattern = re.compile(start + '[^\s]'+'.*' + end )
print(f'{pattern = }')
for pre, fill, post in zip(prefixes, fillers, postfixes):
    txt = pre + start + fill + end + post
    print(f'{txt = }; match:', pattern.match(txt))
    print(f'{txt = }; search:', pattern.search(txt))

pattern = re.compile('ab[^\\s].*cd')
txt = '1ab cd'; match: None
txt = '1ab cd'; search: None
txt = 'ab12cd3'; match: <re.Match object; span=(0, 6), match='ab12cd'>
txt = 'ab12cd3'; search: <re.Match object; span=(0, 6), match='ab12cd'>
txt = 'Xab#2cd'; match: None
txt = 'Xab#2cd'; search: <re.Match object; span=(1, 7), match='ab#2cd'>


* '|' - or matching

In [118]:
pattern = re.compile(r"read|story|car")
sentences = ['I like to read a book', 'Story is about a little car', 'Protagonist has a long history']
ml = max([len(x) for x in sentences])
for sentence in sentences:
    print(f'{sentence:<{ml}}; match:', pattern.search(sentence))

I like to read a book         ; match: <re.Match object; span=(10, 14), match='read'>
Story is about a little car   ; match: <re.Match object; span=(24, 27), match='car'>
Protagonist has a long history; match: <re.Match object; span=(25, 30), match='story'>


results (sentences):
1. read match - true;
2. Story did not match with story, next match car;
3. story matched in history

* '\b' - add boundaries

In [124]:
pattern = re.compile(r"\bread\b|\bstory\b|\bcar\b")
for sentence in sentences:
    print(f'{sentence:<{ml}}; match:', pattern.search(sentence))


I like to read a book         ; match: <re.Match object; span=(10, 14), match='read'>
Story is about a little car   ; match: <re.Match object; span=(24, 27), match='car'>
Protagonist has a long history; match: None


* add case insensitivity flag and group using '()' or add pattern modifier 'i *'

In [133]:
pattern = re.compile(r"\b(read|story|car)\b", re.IGNORECASE)
pattern = re.compile(r"(?i)\b(read|story|car)\b")
for sentence in sentences:
    print(f'{sentence:<{ml}}; match:', pattern.search(sentence))

I like to read a book         ; match: <re.Match object; span=(10, 14), match='read'>
Story is about a little car   ; match: <re.Match object; span=(0, 5), match='Story'>
Protagonist has a long history; match: None


: 