### Credits:

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br /> For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />

Reused and modified for internal use at Università Cattolica del Sacro Cuore di Milano, by Deborah Grbac, email deborah.grbac@unicatt.it and Valentina Schiariti, email valentina.schiariti-collaboratore@unicatt.it, released under CC BY License.

This repository is founded on **Constellate notebooks**. The original Jupyter notebooks repository was designed by the educators at **ITHAKA's Constellate project**. The project was sunset on July 1, 2025. This current repository uses and resuses Constellate notebooks as Open Educational Resources (OER), free for re-use under a Creative Commons CC BY License.
___

# Regular Expressions

This lesson introduces the `re` module for analyzing strings with regular expressions. Students will be able to:
* Create regular expressions
* Create a Regex object with `re.compile()`
* Use the `findall()` and `finditer()` methods to return a match object
* Return strings of the actual matched text
___

## Introduction

Regular expressions can be used to **locate particular characters or sequences of characters in a string**. 

For example, a regular expression could be written to identify phone numbers, email addresses, or particular names. Far beyond simply matching a known string, regular expressions can be written to find complex patterns in a text. They are often useful when the documents being searched are very long. 

Regular expressions can be used in Python, but also in many other applications such as other programming languages, word processing software (Microsoft Word, Google Docs), or email. 

Crafting the right regular expression can be very difficult, but can often save hours of labor for many menial tasks. When crafting a regular expression, it can be very helpful to use a tool like **[RegExr](https://regexr.com/)** that demonstrates how expressions are being matched on a few sample texts as you type them (we will use it in the first part of the class). 

The tailored expression could then be implemented in a fuller solution with Python.

# Part One: Regular Expression Basics
On the most basic level **each character or set of characters will match itself**.
This means that:

* *x* will match all the *x* in the text;
* *9* will match all the *9* in the text;

![Basic Matching](./data/Regex/basic_match.png)

And so on.

In addition to this, there are a set of “special” characters (**metacharacters**) that have a special meaning in regex.


## Metacharacters
When executing a search pattern, regular expressions make use of special **metacharacters**, each with its own meaning and usage:

`. ^ $ * + ? { } [ ] \ | ( )`

|Expression|Matches|
|---|---|
|.|Matches any single character (except newline)|
|+| Matches one or more occurrences of a specific element |
|*| Matches zero or more occurrences of a specific element|
|?| Makes the previous element optional (0 or 1 time) |
|^| Matches the start of a line |
|$| Matches the end of a line|
|( )| Groups characters and captures them |
|[ ]| Matches one character from the set |
|{ }| Specifies exact or range of repetitions |
|\ | Escapes special characters or starts shortcuts |
| \| | Allows to match one pattern or another. |



### The `.`

The period `.` allows selecting any character, including special characters and spaces:

![the dot match](./data/Regex/dot.png)

### Character Sets: The [...] brackets

We can define a set of potential characters to match by putting them in brackets `[]`:

|Expression|Matches|
|---|---|
|[ ]| Characters in brackets |
|[^ ]| Characters not in brackets |

We can specify exact characters to match:

* `[.,-]` Match a period, comma, or dash
* `[rs]` Match the lowercase letter r or s

![square_brackets](./data/Regex/square_brackets.png)
  
or we can specify a **range** to match, using values that are sequential, such as:

* `[A-Z]` Match any capital letter, from A to Z
* `[A-F]` Match any capital letter, from A to F
* `[a-z]` Match any lowercase letter, a to z
* `[A-fa-f]` Match any letter, regardless of case from A to F
* `[0-3]` Match any number, from 0-3

![ranges_letters](./data/Regex/ranges_letters.png)

![ranges_numbers](./data/Regex/ranges_numbers.png)

We can use the `[]` brackets also to exclude certain characters: 

* `[^t]` Match any character that is not lowercase t

![not_square_brackets](./data/Regex/not_square_brackets.png)


### Quantifiers: * + ? and {...}

**Quantifiers** let us repeat a character match for some additional number of characters. 

|Expression|Matches|
|---|---|
|\*| 0 or more |
|+| 1 or more |
|?| 0 or 1 |
|{4}| Exact number |
|{3,6}| Minimum to maximum range |


* **Asterix**: We put an asterisk `*` after a character to indicate that the character may occur zero times or many times:

![asterix](./data/Regex/asterix.png)

* **Plus sign**: We put a plus sign `+` after a character to indicate that the character must occur at least once, but may occur more than once:

![plus_sign](./data/Regex/plus_sign.png)


* **Question mark**: We put a question mark `?` after a character to indicate that the character is optional (it may occur zero or one time):

![question_mark](./data/Regex/question_mark.png)


#### The {...} brackets

To express an exact number of occurrences of a character, we place **curly braces `{n}`** immediately after the character. The number n indicates how many times the character must occur in a row:

![curly_brackets](./data/Regex/curly_brackets.png)

To express **at least a certain number of occurrences**, we use **`{n,}`** after the character. This means the character must occur n times or more.

To express that a character may occur **within a certain range of times**, we use curly braces **`{x,y}`**. This means the character can occur at least x times and at most y times:

![curly_brackets_1](./data/Regex/curly_brackets_1.png)

### Groups: the (...) brackets

Parentheses `(...)` are used to group parts of a regex together. Grouping allows to treat multiple characters as a **single unit** as they must occur together, to extract sub-patterns, or organize complex expressions.

|Expression|Matches|
|---|---|
|(A\|B\|C)| Capital A or capital B or capital C|


#### The `|` 

The pipe character `|` works like a logical OR. It lets you match one expression or another expression:

![pipe_character](./data/Regex/pipe_character.png)


### Anchors 

An anchor helps search particular text areas, such as string beginnings, string endings, or word boundaries.

|Expression|Matches|
|---|---|
|^| Beginning of string |
|$| End of string |
|\b| Word boundary |
|\B|Not a word boundary |

#### The `^`

The caret `^` is used to match only at the beginning of a line.

![caret_sign](./data/Regex/caret_sign.png)

#### The `$` 

The dollar sign `$` is used to match only at the end of a line.

![dollar_sign](./data/Regex/dollar_sign.png)

### Word boundary (\b and \B)

In regular expressions, `\b` and `\B` are word-boundary anchors. They do not match actual characters. Instead, they match positions in a string.

`\b` matches a position where a word character (defined as [A-Za-z0-9_]) is next to a non-word character (or the start/end of the string). In other words, it matches the start of a word and the end of a word, but not when the character is inside another word: 

Examples:
* \bcat\b matches "cat"
* \bcat\b does not match "catalog"
* \bcat matches "cat" in "catfish"

![boundary_word](./data/Regex/boundary_word.png)

`\B` matches a position that is not a word boundary. This means the position is between two word characters, or between two non-word characters

Examples: 
* \Bcat matches "cat" in "catalog"
* \Bcat\b matches "cat" inside a longer word but not as a standalone word

![not_boundary_word](./data/Regex/not_boundary_word.png)

### The `\` 
If we want to include in the pattern one of the special metacharacters, we can use the \ escape character:

* `\.`: Matches a literal dot
* `\*`: Matches a literal asterisk
* `\$`: Matches a literal dollar sign

If you want to match a backslash itself, you need to write `\\`

### Character Classes

Using the metacharacters in our search pattern will allow us to search particular classes of characters.

|Expression|Matches|
|---|---|
|.| Any character except a new line `\n` |
|\d| A digit (0-9) |
|\D| Not a digit |
|\w| Word character (a-z, A-Z, 0-9, \_) |
|\W| Not a word character, not a new line |
|\s| Whitespace (space, tab new line) |
|\S| Not a whitespace |

* **Word character `\w`**:
  
The expression `\w` matches any single letter, digit, or underscore. It does not match spaces, punctuation, or other special characters.

![word_character](./data/Regex/word_character.png)

NB: use **`\w+`** to match one or more consecutive word characters, which often corresponds to full words.

* **Except word character `\W`**
The expression `\W` is used to find characters other than letters, numbers, and underscores. This includes spaces, punctuation, and symbols.

![not_word_character](./data/Regex/not_word_character.png)

* **Number Character `\d`** 
The expression `\d` matches any single digit (0–9).

![digit_character](./data/Regex/digit_character.png)

* **Except Number Character `\D`**:
The expression `\D` matches any character that is NOT a digit.

![not_digit_character](./data/Regex/not_digit_character.png)

* **Space Character `\s`**:
The expression `\s` matches any space character, including (space ( ) / tab (\t) / newline (\n) )

![space_character](./data/Regex/space_character.png)

* **Except Space Character `\S`**:
`\S` is used to find non-space characters (as described above)

![not_space_character](./data/Regex/not_space_character.png)

## Part two: Using the `re` module

The **re module** is a built-in Python module that allows us to use regular expressions in Python programs.

The re module offers a great deal of flexibility in working with regular expressions. The workflow for using `re` generally follows this format:

1. Import the `re` module and put the text being searched into a string;
2. Create a Regex object with `re.compile()`;
3. Pass the string into the compiled Regex object using a method such as:
    * `.findall()`
    * `.finditer()`
4. Return the matches

Let's examine these steps in a little more detail.


### Import the `re` module and put the search text into a string
Import the `re` module with
```import re```

In [2]:
import re

Create a variable containing the string object to be searched. This could be loaded from a file, such as a text, CSV, or JSON file. (For information on loading data from a file in Python, see [Python Intermediate 2](../Python%20Notebooks/Python_intermediate/python-intermediate-2.ipynb)

### Create the Regex object with `re.compile()`

In Python, we usually write regular expressions as **raw strings** (e.g., `r"..."`) and use them with functions provided by the `re` module, such as `re.search()`, `re.findall()`, and `re.sub()`, or by compiling them with `re.compile()`. 

Raw strings are commonly used for regular expressions because regex syntax relies heavily on backslashes (`\`) and other special characters. In a raw string, backslashes are treated as literal characters rather than as escape sequences (such as `\n` for newline or `\t` for tab). This prevents Python from interpreting these sequences before the pattern is passed to the regex engine.

In [4]:
# A demonstration of a regular string with an escape character
string = 'Regular string: \n A new line is created. \n'
print(string)

# A demonstration of a raw string where the escape character is ignored
raw_string = r'Raw string: \n The new line escape character is ignored.'
print(raw_string)

Regular string: 
 A new line is created. 

Raw string: \n The new line escape character is ignored.


After defining the pattern through the raw string, we **compile the regular expression**  with the `re.compile()` function to create a reusable Regex object that represents the pattern to be matched. Technically, it is not always necessary to use `re.compile()` to create a Regex Object, but doing so will make your matches go faster. On small documents, the difference is insignificant, but it is a good practice since it will improve the speed of larger searches.

### Pass the string to be searched into the Regex Object

The Regex Object in the last step established the pattern for the search. In this step, we pass the string to be searched with the Regex Object pattern. 

The `re` module includes a variety of methods, including:

* **.findall()**
Return all non-overlapping pattern matches as list of strings or tuples. Will return match groups if the pattern contains groups.
* **.finditer()**
Return an iterator that yields match objects over all non-overlapping matches.

Additional methods are documented in the official [Python re documentation](https://docs.python.org/3/library/re.html). 


### A basic example with `.findall()`

Let's consider the following data: 
```
Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284
```

In [1]:
# Import the re module
import re

# The text to search
text = '''
Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284
'''

If we want to **extract the phone numbers** of the document, we can study their patterns and write a regular expression. We see that phone numbers come in three different formats:

* 323-423-4353 → separated by dashes

* 102.343.3784 → separated by dots

* 444|343|4387 → separated by pipes (|)

So you need a **regex pattern** that matches: 3 digits - separator - 3 digits - separator - 4 digits, where the separator can be -, ., or |.

* \d{3} --> indicates that the pattern starts with digits and the number of digits is 3;
* we then create a **character class** like this [-.|], meaning that we want to match any of these characters inside the [] parenthesis

we write the whole pattern by combining these elements together:

* \d{3}[-.|]\d{3}[-.|]\d{4}

Lastly, we add **word boundaries** to ensure you don’t match numbers inside longer strings, using the \b character.

Putting it together we have \b\d{3}[-.|]\d{3}[-.|]\d{4}\b

In [2]:
# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'\b\d{3}[-.|]\d{3}[-.|]\d{4}\b')

In [3]:
# Use the `.findall()` method to gather all the matches into a list
matches = pattern.findall(text)

In [4]:
# Print the list of matches
print(matches)

['323-423-4353', '102.343.3784', '444|343|4387', '439|963|6284']


If the expression passed into `re.compile()` contains no groups, then the output will be a list of matching strings. If the expression does contain groups, the output will be a list of tuples containing only the matching groups.

Another example of analysis could be to group elements by honorific, first and last name:

* honorific: come in "Mr.", "Mrs", "Ms", "Ms." . The pattern can therefore be described as `(M[rs]+\.?)` where `M` → matches the literal M (start of title), `[rs]+` → matches one or more letters r or s and `\.?` → matches an optional period.
* first and last name: the pattern can be described as `\s(\w+.?)\s(\w+)`, where the `\s` match a single space after the honorific and the first name and the group `(\w+.?)` matches one or more word characters (`\w+`) and include a optional period (`.?`).

Putting these elements together we can have `(M[rs]+\.?)\s(\w+.?)\s(\w+)`.

In [5]:
# Grouping by Honorific, First Name, Last Name
pattern = re.compile(r'(M[rs]+\.?)\s(\w+.?)\s(\w+)')

matches = pattern.findall(text)
print(matches)

[('Mr.', 'alex', 'arvison'), ('Mrs', 'Dara', 'Batha'), ('Ms', 'T', 'Lamcken'), ('Ms.', 'M.', 'Picardo')]


### A basic example with `.finditer()`


In [10]:
# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'(M[rs]+\.?)\s(\w+.?)\s(\w+)')

# Use the `.finditer()` method to gather all the matches
# into an iterable "match object".
matches = pattern.finditer(text)

# Iterate over the matches and print them out
for match in matches:
    print(match)

<re.Match object; span=(1, 17), match='Mr. alex arvison'>
<re.Match object; span=(54, 68), match='Mrs Dara Batha'>
<re.Match object; span=(103, 115), match='Ms T Lamcken'>
<re.Match object; span=(150, 164), match='Ms. M. Picardo'>


When using the `.finditer()` method, each match object contains two important pieces of information:

* **span**: The starting and ending index number for the match within the searched string.
* **match**: The actual characters from the string which fulfilled the Regex Object match. 

In [11]:
# Verifying the index number slice for the match
print(text[154:168])

M. Picardo
mpi


When using `finditer()`, the groups within a match can be referenced using the `.group()` method.

* `.group(0)` returns the full match
* `.group(1)` returns the first group
* `.group(2)` returns the second group

In [12]:
# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'(M[rs]+\.?)\s(\w+.?)\s(\w+)')

# Use the `.finditer()` method to gather all the matches
# into an iterable "match object".
matches = pattern.finditer(text)

# Iterate over the matches and print them out
for match in matches:
    print(match.group(0))

Mr. alex arvison
Mrs Dara Batha
Ms T Lamcken
Ms. M. Picardo
