![sot.png](attachment:sot.png)

# Regex 
Regex is a powerful tool to parse texts of many kinds, refactor existing text, or generate new text with list comprehensions.
It is an invaluable tool to the aspiring NLP worker for standardization of language and working textual data.

In [19]:
import re

from IPython.display import Image
from IPython.core.display import HTML

# Finding matched cases of words
note the r in front of our regex search term, which converts our search delimiters in the regex expression from unicode to it's character value

## Finding words seperated by spaces vs expressions not seperated by space 
Regex commonly would refer to spaces as a boundary
The re library has a few switches that can help us here.  

\b[expression goes here]\b will return words seperated by boundaries    

\B[expression goes here]\B will return words without boundaries

In [13]:
sentence = "The grand, grey Tabby pounced upon the still-struck mouse."

boundary = re.search(r"\bgrand\b", sentence)

broken_search = no_boundary = re.search(r"\bran\b", sentence)

no_boundary = re.search(r"\Bran\B", sentence)

print("Search for grand with spaces:",  boundary.string[boundary.start():boundary.end()])
print("Search for ran within words, ie no boundary:", no_boundary.string[no_boundary.start():no_boundary.end()])
print("Example of failed search return type:", broken_search)

Search for grand with spaces: grand
Search for ran within words, ie no boundary: ran
Example of failed search return type: None


# Getting slightly more lazy
We might find ourselves in situations where we do not wish to transform the text to lower or uppercase, and need a way to search for words with varying captilization. 

We are in luck! 
Regex allows the use of disjunction in it's expressions.
We can specify this disjunction with square braces [ ] like so


In the below example both My and my are the same word. We can find both of this with regex.

In [18]:
sentence = "My mother told my father that he was drinking far too much milk."

disjunction_search = re.findall(r"[mM]y", sentence)
print("Search for both forms of my:",  disjunction_search)

Search for both forms of my: ['My', 'my']


# Get your apathetic on
We can continue to increase our lazy life by using explicit values to search for ranges of characters.
We specify a range of characters/digits with -

![Image of_soup_cat](https://i.pinimg.com/236x/89/af/62/89af6202422aac983f690a8eaf565cda.jpg)

We can further increase our levels of being lazy by using something known as the Kleene star, or *. This essentially says 
"one or more of the char/expression to the left"

In [87]:
sentence = """3429342938742384723487293749273 2394729347
293472 93472 293497 2937490 7239472942 39874 hi 2348230482034829304 23908420-38420384 230482034802834023482 283409 82034 hi"""

find_hi = re.findall(r"[a-z*]", sentence)

In [88]:
print("Return value of our specified range:", find_hi)

Return value of our specified range: ['h', 'i', 'h', 'i']


# What if you have multiple written dialetcs in your documents or mutiple verb tense? 
Some of you might call this mush mouth, some of you may call it the South, I call it people.

You can use the period character, or the wildcard to parse for potential complications in textual differences, or mispelled words.

In [90]:
dia = "begin, beg’n, begun"

re.findall(r"beg.n", dia)

['begin', 'beg’n', 'begun']

# AND OR
Our cat is ready to start a factory!

We can use pipes or | to express that we want to search for cat or mouse 

We can use ( ) to express we want to search for cat and tuna

<img src="https://66.media.tumblr.com/02f985e9476f5e6afae6f65d205f0c5a/tumblr_pf246kyvK31sulisxo1_1280.png" alt="factory cat" style="width: 450px;">

In [177]:
sentence = "My cats hate the single cat that stole the spotlight on the catwalk. They say he was a catburglar"

re.findall(r"(cat(s|walk|burglar))", sentence)

[('cats', 's'), ('catwalk', 'walk'), ('catburglar', 'burglar')]