# Classical NLP - regex
* Notebook by Adam Lang
* Date: 6/26/2024
* Classical NLP techniques for regex in Python.

## re module in python
* Typical workflow:
1. import re
2. pattern creation
3. compile pattern (optional depending on workflow and goal)
4. Match the pattern using re.<function>
   * findall() --> returns string
   * finditer() --> returns generator
   * match() --> beginning of string match --> only 1 match
   * search() --> ANY location where string matches
   * ...etc....

5. There are also methods to operate on a string
  * split() - splits string based on pattern
  * sub() - replaces all instances of the string based on your pattern

In [1]:
## import re
import re

In [2]:
# test string
test_string = '123abc456789abc123ABC'

#### `finditer()` - returns the generator object

In [18]:
# search for abc pattern
pattern = re.compile(r"abc")

matches = pattern.finditer(test_string)

# iterate over
for i in matches:
  print(i)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


* We returned 2 matches.
* We have the span for the start and end of the characters, and the matches.
* `re.compile` --> compiles the expression to use later in the same program.
* `re.finditer` --> returns output in a string instead of a list

In [17]:
## do this without using compile
matches = re.finditer(r"abc", test_string)

for i in matches:
  print(i)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


### What is "r"
* r is a "raw string"
  * A raw string will print the pattern "as is" including any special characters so if there is a \n or \t it will be printed.
  * If we don't use the 'r', then it won't print the extact or raw string.

In [8]:
a = r"\tHello"
b = "\nHello"
c = r"\nHello \n how \n are \t"
print(a)
print(b)
print(c)

\tHello

Hello
\nHello \n how \n are \t


## Methods to search for matches

In [16]:
## finditer - returns generator objects
matches = pattern.finditer(test_string)

for i in matches:
  print(i)

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>


In [15]:
## findall - returns string ONLY
matches = pattern.findall(test_string)

for i in matches:
  print(i)

abc
abc


In [20]:
## match - ONLY beginning of string match
matches = pattern.match(test_string)
print(matches)

None


In [21]:
## now if we change the pattern to find the beginning of original string
pattern = re.compile(r"123")
matches = pattern.match(test_string)


print(matches)

<re.Match object; span=(0, 3), match='123'>


* It returned 1 match at the beginning of the orignal string

In [22]:
## search --> any location where there is a match
pattern = re.compile(r"abc")
matches = pattern.search(test_string)

print(matches)

<re.Match object; span=(3, 6), match='abc'>


## Methods on a match object
* group
* start
* end
* span

In [23]:
import re

test_string = '123abc456789abc123ABC'

`span()` method

In [27]:
## span method
pattern = re.compile(r"abc")
matches = pattern.finditer(test_string)

# loop through this
for m in matches:
  print(m.span()) ## span returns a tuple of the spans

(3, 6)
(12, 15)


`start()` and `end()` methods

In [28]:
## span start and end
matches = pattern.finditer(test_string)

for m in matches:
  print(m.span(), m.start(), m.end())

(3, 6) 3 6
(12, 15) 12 15


`group()` method
* You can also call the index of the grouping using an integer parameter

In [30]:
matches = pattern.finditer(test_string)

for m in matches:
  print(m.group(0))

abc
abc


## Meta characters
* These are special characters: . ^ $ * + ? { } [ ] \ | ( )

`.` any character (except newline character)

`^` Starts with ("^hello")

`$` Ends with ("world$")

`*` Zero or more occurrences ("aix*")

`+` One or more occurrences ("aix+")

`{}` Exactly the specified number of occurrences "al{2}"

`[]` A set of characters "[a-m]"

`\` Special sequence (or escape special characters) "\d"

`|` Either or "falls|stays"

`()` Capture and group


In [32]:
## looking for the "."
test_string = '123abc456789abc123ABC'


pattern = re.compile(r".")
matches = pattern.finditer(test_string)

for m in matches:
  print(m)

<re.Match object; span=(0, 1), match='1'>
<re.Match object; span=(1, 2), match='2'>
<re.Match object; span=(2, 3), match='3'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(4, 5), match='b'>
<re.Match object; span=(5, 6), match='c'>
<re.Match object; span=(6, 7), match='4'>
<re.Match object; span=(7, 8), match='5'>
<re.Match object; span=(8, 9), match='6'>
<re.Match object; span=(9, 10), match='7'>
<re.Match object; span=(10, 11), match='8'>
<re.Match object; span=(11, 12), match='9'>
<re.Match object; span=(12, 13), match='a'>
<re.Match object; span=(13, 14), match='b'>
<re.Match object; span=(14, 15), match='c'>
<re.Match object; span=(15, 16), match='1'>
<re.Match object; span=(16, 17), match='2'>
<re.Match object; span=(17, 18), match='3'>
<re.Match object; span=(18, 19), match='A'>
<re.Match object; span=(19, 20), match='B'>
<re.Match object; span=(20, 21), match='C'>


This is a problem! Using the "." alone looks for every character except new line.

In [33]:
## lets update the string with the "."
test_string = '123abc456789abc123ABC.'

# we need to escape the . in order to find it --> new line pattern
pattern = re.compile(r"\.")
matches = pattern.finditer(test_string)

for m in matches:
  print(m)

<re.Match object; span=(21, 22), match='.'>


Aha! Now we have found the "." by itself.

In [34]:
## now what about the "^" -- this is the start of the line
pattern = re.compile(r"^123") # finds the 123 at the start of our string
matches = pattern.finditer(test_string)

for m in matches:
  print(m)

<re.Match object; span=(0, 3), match='123'>


In [40]:
## now match for the end of a string using the "$"
pattern = re.compile(r"[A-Z]+.$")
matches = pattern.finditer(test_string)

for m in matches:
  print(m)

<re.Match object; span=(18, 22), match='ABC.'>


## More special sequences
* Note: The capital pattern is the opposite of the small case character.
* \d : Matches any decimal digit; [0-9]
* \D : Matches any non-digit character;
* \s : Matches any whitespace character; (space " " tab "\t" newline "\n"
* \S : Matches any non-whitespace character;
* \w : Matches any alphanumeric (word) character; [a-zA-Z0-9_].
* \W : Matches any non-alphanumeric character;
* \b : Matches where the specified characters are at the beginning or end of a word.
* \B : Matches where the specified characters are present, but NOT at beginning.

In [41]:
# new test string
test_string = 'hello 123_ heyho hohey'

In [42]:
# pattern
pattern = re.compile(r'\d')
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>


In [43]:
## \D now....Finds all characters except 1, 2, 3
pattern = re.compile(r"\D")
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(16, 17), match=' '>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [44]:
## whitespace --> \s
pattern = re.compile(r"\s")
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(16, 17), match=' '>


In [45]:
## now opposite of whitespace --> \S (any non-whitespace character)
pattern = re.compile(r"\S")
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [47]:
## \b --> beginning or end of word
pattern = re.compile(r'\bhey')
matches = pattern.findall(test_string)
for m in matches:
  print(m)

hey


In [48]:
## \B --> opposite --> where characters are present but not at beginning
pattern = re.compile(r'\Bhey')
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(19, 22), match='hey'>


This result shows us 'hey' but it is obviously not at the beginning of the word.

## Sets
* This is what the `[ ]` is used for, to find a set of a regex pattern.
* Ranges are commonly used such as: `[a-z]` or `[A-Z]` or `[0-9]`

In [54]:
test_string = 'hello 123_'

pattern = re.compile(r"[helo]+")
matches = pattern.findall(test_string)

print(matches)

['hello']


In [56]:
# numbers
pattern = re.compile(r"[0-9]")
matches = pattern.findall(test_string)
print(matches)

['1', '2', '3']


In [57]:
## back to back regex matching
test_string = 'helloHELLO 123-_'

pattern = re.compile(r"[a-zA-Z0-9]")
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='H'>
<re.Match object; span=(6, 7), match='E'>
<re.Match object; span=(7, 8), match='L'>
<re.Match object; span=(8, 9), match='L'>
<re.Match object; span=(9, 10), match='O'>
<re.Match object; span=(11, 12), match='1'>
<re.Match object; span=(12, 13), match='2'>
<re.Match object; span=(13, 14), match='3'>


## Quantifier
* : 0 or more
* `+` : 1 or more
* `?` : 0 or 1, --> optional character
* {4} : exact number match
* {4,6} : range numbers (min, max)

In [58]:
test_string = 'hello_123'

# this finds all digits, zero or more
pattern = re.compile(r'\d*')
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 5), match=''>
<re.Match object; span=(6, 9), match='123'>
<re.Match object; span=(9, 9), match=''>


In [59]:
## now looking for 1 or more
pattern = re.compile(r'\d+')
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(6, 9), match='123'>


In [61]:
## using the '?' we can search with logic of something may or may not be there such as this
test_string = 'hello_123'

pattern = re.compile(r'_?\d')
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(5, 7), match='_1'>
<re.Match object; span=(7, 9), match='_2'>
<re.Match object; span=(9, 11), match='_3'>


In [64]:
## looking for a specific range of digits
pattern = re.compile(r'\d{1,3}')
matches = pattern.finditer(test_string)
for m in matches:
  print(m)

<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(8, 9), match='2'>
<re.Match object; span=(10, 11), match='3'>


### Now an example using Date strings
* Notice the different date formats.
* Without using the Python `datetime` module this would be using regex.

In [67]:
dates = """
hello
01.04.2020

2020.04.01

2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11

2020/04/02

2020_04_04
2020_04_04
"""

In [68]:
## new pattern match
pattern = re.compile(r'\d\d\d\d.\d\d.\d\d')
matches = pattern.finditer(dates)
for m in matches:
  print(m)

<re.Match object; span=(19, 29), match='2020.04.01'>
<re.Match object; span=(31, 41), match='2020-04-01'>
<re.Match object; span=(42, 52), match='2020-05-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>
<re.Match object; span=(75, 85), match='2020-08-11'>
<re.Match object; span=(87, 97), match='2020/04/02'>
<re.Match object; span=(99, 109), match='2020_04_04'>
<re.Match object; span=(110, 120), match='2020_04_04'>


Summary:
* Notice it did not find all date formats.

In [70]:
## change format to precisely find the "-"
pattern = re.compile(r'\d\d\d\d-\d\d-\d\d')
matches = pattern.finditer(dates)
for m in matches:
  print(m)

<re.Match object; span=(31, 41), match='2020-04-01'>
<re.Match object; span=(42, 52), match='2020-05-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>
<re.Match object; span=(75, 85), match='2020-08-11'>


In [73]:
## To make this more flexible we can use a set []
pattern = re.compile(r'[0-9]{4}[-/.][0-9]{2}[-/.][0-9]{2}')
matches = pattern.finditer(dates)
for m in matches:
  print(m)

<re.Match object; span=(19, 29), match='2020.04.01'>
<re.Match object; span=(31, 41), match='2020-04-01'>
<re.Match object; span=(42, 52), match='2020-05-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>
<re.Match object; span=(75, 85), match='2020-08-11'>
<re.Match object; span=(87, 97), match='2020/04/02'>


In [76]:
## now lets say we only want month of may, june, july
pattern = re.compile(r'\d\d\d\d[-/]0[5-7][-/]\d\d')

matches = pattern.finditer(dates)
for m in matches:
  print(m)

<re.Match object; span=(42, 52), match='2020-05-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>


In [77]:
## let's simplify this
pattern = re.compile(r'\d{4}[-/]0[5-7][-/]\d{2}') #finds 4 digits, may-july, 2 digits

matches = pattern.finditer(dates)
for m in matches:
  print(m)

<re.Match object; span=(42, 52), match='2020-05-23'>
<re.Match object; span=(53, 63), match='2020-06-11'>
<re.Match object; span=(64, 74), match='2020-07-11'>


## Conditions

In [85]:
## new string
my_string = """
hello world
1223
2020-05-20
Mr Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr. T
pythonengineer@gmail.com
Python-engineer@gmx.de
python-engineer123@my-domain.org


"""

In [84]:
## create pattern
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')
matches = pattern.finditer(my_string)
for m in matches:
  print(m)

<re.Match object; span=(29, 39), match='Mr Simpson'>
<re.Match object; span=(40, 51), match='Mrs Simpson'>
<re.Match object; span=(52, 61), match='Mr. Brown'>
<re.Match object; span=(62, 70), match='Ms Smith'>
<re.Match object; span=(71, 76), match='Mr. T'>


### Extracting only the emails from the text `my_string` above

In [101]:
pattern = re.compile(r'([a-zA-Z0-9-]+)@([a-zA-Z-]+)\.([a-zA-Z]+)')

In [102]:
matches = pattern.finditer(my_string)
for m in matches:
  print(m)

<re.Match object; span=(77, 101), match='pythonengineer@gmail.com'>
<re.Match object; span=(102, 124), match='Python-engineer@gmx.de'>
<re.Match object; span=(125, 157), match='python-engineer123@my-domain.org'>


## Grouping
* Grouping allows us to separate each regex pattern into separate components.

In [103]:
pattern = re.compile(r'([a-zA-Z0-9-]+)@([a-zA-Z-]+)\.([a-zA-Z]+)')

In [106]:
matches = pattern.finditer(my_string)
for m in matches:
  print(m.group(0))
  print(m.group(1))
  print(m.group(2))
  print(m.group(3))

pythonengineer@gmail.com
pythonengineer
gmail
com
Python-engineer@gmx.de
Python-engineer
gmx
de
python-engineer123@my-domain.org
python-engineer123
my-domain
org


## Modification
* `split()` --> splits into a list
* `sub()` --> finds all substrings where regex matches and replaces with different string.

### Split method

In [108]:
# new test string
test_string = '123abc456789abc123ABC'

In [111]:
pattern = re.compile(r'abc')

splitted = pattern.split(test_string)
print(splitted) #prints a list

['123', '456789', '123ABC']


### Sub method

In [112]:
test_string = 'hello world, you are the best world'

In [113]:
pattern = re.compile(r'world')
## add 'planet' to replace this
subbed_string = pattern.sub('planet',test_string)
print(subbed_string)

hello planet, you are the best planet


In [120]:
## another sub method with urls
urls = """
hello
2020-05-20
http://python-engineer.com
https://www.python-engineer.com
http://www.pyeng.net

"""


In [125]:
pattern = re.compile(r'https?://(www\.)?([a-zA-Z-]+)\.[a-zA-Z]+')
matches = pattern.finditer(urls)
for m in matches:
  print(m)

<re.Match object; span=(18, 44), match='http://python-engineer.com'>
<re.Match object; span=(45, 76), match='https://www.python-engineer.com'>
<re.Match object; span=(77, 97), match='http://www.pyeng.net'>


In [126]:
## now replace the domain of the string


subbed_urls = pattern.sub("hello", urls)
print(subbed_urls)


hello
2020-05-20
hello
hello
hello




## Compilation flags
* ASCII, A : Makes several escapes like \w, \b, \s and \d match only on ASCII characters
* DOTALL, S : Makes . match any char, including newlines.
* IGNORECASE, I : Do case-insensitive matches.
* LOCALE, L : Do a local-aware match.
* MULTILINE, M : Multi-line matching, affecting ^ and $.
* VERBOSE, X (for 'extended') : Enable verbose REs, which can be organized.

In [129]:
## IGNORECASE compilation flag is very common
my_string = "Hello World"
pattern = re.compile(r'world', re.I)
matches = pattern.finditer(my_string)
for m in matches:
  print(m)

<re.Match object; span=(6, 11), match='World'>


summary: We used `re.I` to ignore the case sensitivity of the string.