The following is based on:
- https://www.w3schools.com/python/python_regex.asp
- https://ieeexplore.ieee.org/document/8952499
- https://github.com/odenipinedo/Python/blob/master/datacamp/introduction%20to%20natural%20language%20processing%20in%20Python.ipynb

# 2. Introduction to regular expressions

## 2.1. Theory

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
RegEx can be used to check if a string contains the specified search pattern. They are a powerful mechanism for solving string-matching problems and are supported by all modern programming languages. However,  regexes are hard. They are hard to read, they are hard to write, they are hard to validate, they are hard to search for, and they are hard to document. They are also hard to master.

<div align="center">
  <img src="images/regex.jpeg" alt="drawing" width="400"/>
</div>

Examples of use cases:
- Parse emails adresses
- Web Scraping
- Remove/replace unwanted characters
- ...

<div align="center">
    <img src="https://www.novixys.com/blog/wp-content/uploads/2018/02/regex.png" alt="drawing" width="600"/>
</div>

Python has a built-in package called re, which can be used to work with Regular Expressions.

Import the re module:

In [2]:
import re

The re module offers a set of functions that allows us to search a string for a match:

| Function | Description                                                       |
|----------|-------------------------------------------------------------------|
| findall  | Returns a list containing all matches                             |
| search   | Returns a Match object if there is a match anywhere in the string |
| split    | Returns a list where the string has been split at each match      |
| sub      | Replaces one or many matches with a string                        |

The Match object has properties and methods used to retrieve information about the search, and the result:
- span() returns a tuple containing the start and end positions of the match.
- string returns the string passed into the function
- group() returns the part of the string where there was a match

Metacharacters are characters with a special meaning:

| Character | Description | Example |
|---|---|---|
| [] | A set of characters | "[a-m]" |
| \ | Signals a special sequence (can also be used to escape special characters) | "\d" |
| . | Any character (except newline character) | "he..o" |
| $$ | Ends with | "planet$" |
| ^ | Starts with | "^hello" |
| * | Zero or more occurrences | "he.*o" |
| + | One or more occurrences | "he.+o" |
| ? | Zero or one occurrences | "he.?o" |
| {} | Exactly the specified number of occurrences | "he.{2}o" |
| \| | Either or | "falls\|stays" |
| () | Capture and group |  |

A special sequence is a " \ " followed by one of the characters in the list below, and has a special meaning:

| Character | Description | Example |
|---|---|---|
| \A | Returns a match if the specified characters are at the beginning of the string | "\AThe" |
| \b | Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string") | r"\bain" r"ain\b" |
| \B | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string") | r"\Bain" r"ain\B" |
| \d | Returns a match where the string contains digits (numbers from 0-9) | "\d" |
| \D | Returns a match where the string DOES NOT contain digits | "\D" |
| \s | Returns a match where the string contains a white space character | "\s" |
| \S | Returns a match where the string DOES NOT contain a white space character | "\S" |
| \w | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) | "\w" |
| \W | Returns a match where the string DOES NOT contain any word characters | "\W" |
| \Z | Returns a match if the specified characters are at the end of the string | "Spain\Z" |

A set is a set of characters inside a pair of square brackets [] with a special meaning:

| Set | Description |
|---|---|
| [arn] | Returns a match where one of the specified characters (a, r, or n) is present |
| [a-n] | Returns a match for any lower case character, alphabetically between a and n |
| [^arn] | Returns a match for any character EXCEPT a, r, and n |
| [0123] | Returns a match where any of the specified digits (0, 1, 2, or 3) are present |
| [0-9] | Returns a match for any digit between 0 and 9 |
| [0-5][0-9] | Returns a match for any two-digit numbers from 00 and 59 |
| [a-zA-Z] | Returns a match for any character alphabetically between a and z, lower case OR upper case |

## 2.2. Examples

Note 1: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Otherwise you may encounter problems to do with escape sequences in strings.

In [3]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


In [4]:
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


In [5]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [6]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [7]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


In [8]:
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


In [9]:
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


In [10]:
print(x.group())

Spain


## 2.3 Useful tools

This website https://regex101.com/ can be used to easily visualize and debug regex in Python.

This website https://www.autoregex.xyz/ uses artificial intelligence to convert natural language text to regular expressions.

## 2.4. Exercises

Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. 

Practice is the key to mastering RegEx!

In [None]:
import re
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"__"

# Split my_string on sentence endings and print the result
print(re.__(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"__"
print(re.__(__, my_string))

# Split my_string on spaces and print the result
spaces = r"__"
print(re.__(__, my_string))

# Find all digits in my_string and print the result
digits = r"__"
print(re.__(__, my_string))

In [None]:
# SOLUTION

import re
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))