<div align=right>
Winter 2025<br>
Nardin<br>
Lecture 1
</div>

<h1 align=center>Regular Expressions</h1>

<font color='darkblue'> <h2>Jupyter Notebooks</h2> </font>

For most of our classes, we will be using Jupyter notebooks. Jupyter notebooks allow mixing "markdown blocks," like the text you are reading now, and "code blocks," like the block below in grey. 

Markdown blocks:
* Markdown is a lightweigth language to format texts
* To learn the basics (add headers, bullet points, links, emphasis) see [here](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html)

Code blocks:
* To execute a code block: press Ctrl+Enter or the "Run" button in the command tab at the top of the window
* To comment out multiple lines of code within a block: select them and press Ctrl+/ 
* When you open a new notebook it is good practice to clear any pre-existing output: "Cell" > "All Output" > "Clear"
* When you submit homework for this course the output of your code must be displayed: "Cell" > "Run All" (the code must be free of bugs)

Modify, delete, or add new cells in any Jupyter notebook:
* To modify a cell, click on it and change it
* To insert a new cell, click on the "Insert" tab > "Insert Cell Below" (you can decide whether you want it to be as code or Markdown)
* To delete a cell, click on it, then "Edit" > "Delete Cells"

In [1]:
# to execute it press Ctrl+Enter or "Run" at the top of the screen
print("test message")
"last line"

test message


'last line'

<font color='darkblue'> <h2>Learning Objectives</h2> </font>

* Define what regular expressions are and what they do
* Use the most important functions of the <code>re</code> module: search(), match(), findall(), split(), sub()
* Recognize and use metacharacters and quantifiers 
* Apply regular expression syntax to create simple and medium difficulty level matching patterns 
* Understand grouping and backreferences

<font color='darkblue'> <h2>Definition</h2> </font>
    
**Regular expressions** ("regex" or "regexes") **are strings containing normal characters and/or special meta-characters.** They describe a specific pattern to match in a given text. More formally, regex is a *language on its own right*  used for *pattern matching* in many programming languages.

Regular expressions appear in Python and many other contexts (Java, R, etc).

Given our ability to manipulate strings (with `find`, `replace`, `strip`, etc.) and test whether some string contains another, using the `in` operator, you might wonder why we <em>need</em> regular expressions... <b>because they are powerful!</b> Regular expressions allow us to find complex patterns in any task that deals with text. 

For example:
* Extract characters for texts (e.g. dates, currency conversion rates, find all past tenses in a text)
* Perform textual substitutions (e.g. find-replace)
* Verify text format (e.g. "Is this email/phone-number/address/serial-number valid?")
* Clean textual data
* Only works for Strings!

The downside is that regular experessions are a language on their own and it takes practice to remember and master it.

<font color='darkblue'> <h2>Simple Searches WITHOUT Regular Expressions</h2> </font>

Let's first explore how to perform simple searches using what we know so far (e.g., without regular expressions). We'll define a word and a pattern, then check if the pattern is included in the word (for fun, we can set up the code to accept user input)

In [14]:
# Example: pattern matching without regex using "in"

word = input("Enter a word: ")
pattern = 'as'

if pattern in word:
    print("The pattern matches the input word")
else:
    print("The pattern doesn't match the input word")

The pattern doesn't match the input word


This is fine for a simple substring pattern, but doesn't work for more complex patterns. For example, this won't work to find any string that contains <code>as</code> when the two letters are not close to each other. That's why we want to use Python regular expression, <code>re</code>, module!

<font color='darkblue'> <h2>Simple Searches WITH Regular Expressions</h2> </font>

### search()
Here is how we could do the exact same search with Python's pattern matching tools. We use the <code>search()</code> function in the <code>re</code> module (we will illustrate all main functions in the next session):

In [5]:
# import regular expression library 
import re

In [18]:
# Example: pattern matching with regex

word = input("Enter a word: ")
pattern = 'as'

if re.search(pattern, word): 
    print("The pattern matches the word")
else:
    print("The pattern doesn't match the word")

The pattern doesn't match the word


<code>re.search()</code> takes two arguments: a pattern and a string: 
* if the search is successful (true), the function returns a "match object" 
* if it is not successful (false), the function returns <code>None</code>

So far, this seem just a complication, but with <code>re.search()</code> we can specify more complex patterns that would not be feasible or would require a lot of coding otherwise.

For example, with regular expression we can easily find two letters that are not close to each other (like "a" and "s" not close to each other, vs "as"):

In [None]:
# create an empty list named 'pats' for patterns
pats = []

# any string where an 'a' occurs, followed by exactly two other characters, and an 's' occurs
pats.append('a..s')

# any string where an 'a' occurs, followed by any number of characters, and an 's' occurs
pats.append('a.*s')

print(pats)

['a..s', 'a.*s']


In [17]:
word = input('Enter a word: ')

for pat in pats:
    if re.search(pat, word):
        print("The pattern", pat, "matches")
    else:
        print("The pattern", pat, "doesn't match")
        
# letter 'a' does not have to be at the beginning for example paoos matches in both

The pattern a..s matches
The pattern a.*s matches


### findall()

Let's try another example, this time using another function from the `re` module: `findall()`:

In [None]:
# match all matches of the word "regex" in this string

a = "Let's find the word 'regex' using regexes!"
re.findall(r'regex', a)
# The r before the string indicates a raw string, which means that backslashes are treated as literal characters. 
# The backslashes are used to escape special characters in regular expressions. For example, \w matches any word character,
# If you want to calculate the how many times the word are matched, use len() function, which is len(re.findall(r'regex', a))

['regex', 'regex']

Note we used <b>raw strings</b> in this example. Raw strings in Python are normal strings but prefixed with an "r": in the example we just run we could do without them, but in most cases they are useful as they allow to interpret the backslash \ as a literal character, rather than an escape character. 

Compare what happens if we add a tab-space "\t" and a new-line "\n" to a regular string VS. a raw string:

In [19]:
# regular string
print("This\t will do a tab space and \nthis will go on a new line\n")

# raw string
print(r"This\t won't do a tab space and \nthis won't go on a new line\n")

This	 will do a tab space and 
this will go on a new line

This\t won't do a tab space and \nthis won't go on a new line\n


Why does this matter with regular expressions? 

In most cases it does not matter whether you preface a regular expression with an `r` or not, because you will get the same results. But you need them in all cases in which there are backlash and you need to match them literally. Conceptually this happens: 

* `re.findall("\n", text)`: matches a new line in text
* `re.findall(r"\n", text)`: matches a literal backlash followed by the character 'n' in text

Take home point: it is good practice to use raw strings when writing regular expressions, even when not strickly necessary.

<font color='darkblue'> <h2> Key functions in re module: findall(), search(), match(), split(), sub()</h2> </font>

So far, we have seen findall() and search() from the `re` module:

* <code>re.findall()</code>: scans the ENTIRE string and returns ALL occurrences that match the given pattern 
* <code>re.search()</code>: scans the ENTIRE string but returns only the FIRST occurrence that matches the given pattern.

While `re.findall()` returns a list with all matches, `re.search()` returns a "Match object" or a "None object" if no match is found:

In [20]:
a = "It is fun to find the word 'regex' using regexes"

print(re.findall(r"regex", a))
#type(re.findall(r"regex", a))

['regex', 'regex']


In [21]:
print(re.search(r"regex", a))
#type(re.search(r"regex", a))

<re.Match object; span=(28, 33), match='regex'>


Meaning:
* <code>span=(28, 33)</code> portion of string in which the match was found: start at character position 28 and go up to, but not including 33
* <code>match='regex'</code> characters from the string have been matched
    
We can also display the results as a boolean True or False, with <code>bool((re.search("regex", a))</code>

In [22]:
bool(re.search("regex", a))

True

### match()

A variant of <code>re.search()</code> is <code>re.match()</code>. They both return a match object, but while <code>re.search()</code> scans the ENTIRE string, <code>re.match()</code> scans only the BEGINNING of the string:

In [23]:
b = "123abc"

if re.match("abc", b):
    print("abc found with re.match")
    
if re.search("abc", b):
    print("abc found with re.search")

abc found with re.search


Note that <code>re.match('pattern')</code> equals <code>re.search('^pattern')</code>. See below for the meaning of the ^ sign

### split() and sub()

Other useful functions in the Python `re` module are:

* <code>re.split()</code>: split text based on patterns 
* <code>re.sub()</code>: replace a given match

In [24]:
c = "This is a sentence! re.split will split it based on the pattern! re.sub will replace the pattern"
print(c)

This is a sentence! re.split will split it based on the pattern! re.sub will replace the pattern


In [25]:
re.split(r"!", c)

['This is a sentence',
 ' re.split will split it based on the pattern',
 ' re.sub will replace the pattern']

In [26]:
re.sub(r"!", ";", c)

'This is a sentence; re.split will split it based on the pattern; re.sub will replace the pattern'

**Summary of the most common regex functions:**

<table width=80%>
    <tr>
        <td width=60px><code>match</code></td>
        <td align="left"> determines if the RE matches at the beginning of the string </td>
    </tr>
    <tr>
        <td><code>fullmatch</code></td>
        <td align="left"> determines if the RE matches the whole string </td>
    </tr>
    <tr> 
        <td><code>search</code></td>
        <td align="left"> scans through a string, looking for any location where this RE matches </td>
    </tr>
     <tr>
        <td><code>findall</code></td>
        <td align="left"> finds all substrings where the RE matches, and returns them as a list </td>
    </tr>
    <tr>
        <td><code>split</code></td>
        <td align="left"> splits the string by RE pattern </td>
    </tr>
    <tr>
        <td><code>sub</code></td>
        <td align="left"> replaces a given RE match </td>
    </tr>
    
</table>


<b>EXERCISE 1</b>

Run the given string:

In [28]:
sentence = "start location: loop, mid location: UChicago, end location: UChicago"

1. Use <code>re.search()</code> to check whether it contains "UChicago" in any location:

In [41]:
re.search(r"UChicago", sentence)

<re.Match object; span=(36, 44), match='UChicago'>

2. Check whether it contains "UChicago" at the beginning of the string, and display a boolean result:

In [42]:
bool(re.match(r"UChiago", sentence))

False

3. Check all occurences of "UChicago" in the string:

In [44]:
re.findall(r"UChicago", sentence)

['UChicago', 'UChicago']

4. Replace all occurrences of "UChicago" with "uchicago":

In [40]:
re.sub(r"UChicago", "uchicago", sentence)

'start location: loop, mid location: uchicago, end location: uchicago'

<font color='darkblue'> <h2>Special Characters</h2> </font>

Regexes are much more powerful than just replacing manually specified characters or full words: they also recognize <b>metacharacters</b>. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

**Summary of the most common metacharacters and expressions:**

<table width=90%>
    <tr>
        <td width=60px><code>[abc]</code></td>
        <td align="left"> list of characters in <b>square brackets</b>, will match any one of them</td>
    </tr>
    <tr>
        <td><code>[^abc]</code></td>
        <td align="left"> with an initial caret, matches a single character <b>not in the brackets</b></td>
    </tr>
    <tr>
        <td><code> ^ </code></td>
        <td align="left"> caret anchors a match at the <b>start</b> of the string</td>
    </tr>
    <tr> 
        <td><code> &dollar; </code></td>
        <td align="left"> dollar sign anchors a match at the <b>end</b> of the string</td>
    </tr>
    <tr>
        <td width=60px><code> . </code></td>
        <td align="left"> period matches <b>any</b> single character</td>
    </tr>
      <tr>
        <td><code> \ </code></td>
        <td align="left"> escape character to <b>escape</b> any special character such as \. for "."</td>
    </tr>
    <tr>
        <td width=60px><code>\d</code> </td>
        <td align="left"> <b>one digit</b> character: matches a decimal digit character; same as [0-9]</td>
    </tr>
    <tr>
        <td> <code>\D</code> </td>
        <td align="left"> <b>one non-digit</b> character: matches any character that is not decimal digit; same as [^0-9]</td>
    </tr>
    <tr>
        <td> <code>\w</code> </td>
        <td align="left"> <b>one word</b> character: matches any alphanumeric character (letters, digits, underscores, upper and lower case); same as [a-zA-Z0-9_]</td>
    </tr>
    <tr>
        <td> <code>\W</code> </td>
        <td align="left"> <b>one non-word</b> character: matches any non-alphanumeric character (anything that's not a letter, digit, or underscore); same as [^a-zA-Z0-9_] </td>
    </tr>
    <tr>
        <td> <code>\s</code> </td>
        <td align="left"><b>one whitespace</b> character: matches whitespaces </td>
    </tr>
    <tr>
        <td> <code>\S</code> </td>
        <td align="left"> <b>one non-whitespace</b> character:matches any character that isn't a whitespace </td>
    </tr>
     <tr>
        <td> <code>\b</code> </td>
        <td align="left"> <b>word boundary</b> character: allows to perform a "whole words only" search </td>
    </tr>
</table>


A set of characters specified in square brackets <code>[]</code> makes up a <b>character class.</b> For example, if we want to determine whether a string contains any number of consecutive <b>decimal digit characters</b>, we could use <code>[0-9]</code>:

In [55]:
d = 'ciao_Good_123_12hi'

# find three consecutive decimal characters
re.findall(r"\d{3}", d)
re.findall('[0-9]'*3,d)

['123']

In [None]:
# find two consecutive decimal characters
re.findall('[0-9][0-9]', d)
# can also do re.findall(r"\d{2}", d), \d is finding one digit

['12', '12']

Notice, our regex finds 12, 12 but not 23. How can we find them all? See groups and backreferences are at the end of this notebook and [here](https://stackoverflow.com/questions/5616822/how-to-use-regex-to-find-all-overlapping-matches)

Or we might want to see whether our string contains four <b>alphabetic characters between 'a' and 'z'</b> inclusive, we could use <code>[a-z]</code> or <code>[A-Z]</code>. Before running the code, can you guess what would the following regex match in sentence "d"?

In [None]:
re.findall('[a-zA-Z][a-z][a-z][a-z]', d)

# [a-zA-Z] finds any letter, [a-z] finds any lowercase letter, [A-Z] finds any uppercase letter

['ciao', 'Good']

To <b>match characters not in the brackets</b>, use the caret <code>^</code>. For example, here we use it to match all non-alphabetic characters in sentence "d":

In [59]:
re.findall('[^a-z]', d)

['_', 'G', '_', '1', '2', '3', '_', '1', '2']

If it is outside square brackets <code>[]</code>, the caret <code>^</code> has another meaning: it checks whether one or a set of characters are located at the <b>beginning of a string</b>:

In [60]:
# ^ only looks for matches at the beginning of the string
re.findall('^[a-z][a-z][a-z][a-z]', d)

['ciao']

The <code>$</code> checks whether one or a set of characters is located at the <b>end of a string</b>:

In [61]:
# $ only matches at the end of the string
re.findall('[a-z][a-z]$', d)

['hi']

The dot `.` is a <b>wildcard</b> that matches <b>any character</b> except for newline characters. It is useful when you have a mix of characters (letter, digit, whitespace, etc.) or when you do not have detailed information about the text you are trying to match. For example, we could use 4 dots to match the last four characters of sentence "d":

In [62]:
re.findall(r"....$", d)

['12hi']

In some cases, we need to match the `.` or any other special character <b>literally</b>. To do so, we need to <b>escape</b> the special characters using the backslash `\`. For example, to split sentences on their periods, we use `\` and <code>re.split()</code>:

In [63]:
e = "This is a sentence. This is a second sentence."

re.split(r"\.\s+", e) # include the space so there is not an empty string at the end

['This is a sentence', 'This is a second sentence.']

The <b>"or"</b> operator is represented using the `|` character:

In [64]:
f = "My favorite color is red and Anne's favorite colour is green"
re.findall(r"color|colour", f)

['color', 'colour']

The "or" operator can also be represented by <b>grouping a set of characters</b> using the paranthesis `[]`, which is shorthand for `|` on all of the contents, but it only works for single characters (not for entire words). For example, we might have a set of ugly characters to clean from a sentence:

In [65]:
# clean this sentence with |
g = "This&is#MACS&30122%at#UChicago." 
re.sub(r"&|#|%|#", " ", g) 

'This is MACS 30122 at UChicago.'

In [66]:
# clean this sentence with []
re.sub(r"[&#%#]", " ", g)

'This is MACS 30122 at UChicago.'

Note that we are used <code>re.sub()</code> in both options to substitute stray characters for spaces

To represent ranges of common values as sets, we could use any combinations of <code>A-Za-z0-9</code>:

In [67]:
# matches any series of two letters and one number
h = "pa1, PA2, Pa3, pA4"
re.findall(r"[a-zA-Z][a-zA-Z][0-9]", h)

['pa1', 'PA2', 'Pa3', 'pA4']

We can obtain the same result with the following regular expression, which uses the special character <code>\d</code> rather than the set <code>[0-9]</code>:

In [68]:
re.findall(r"[a-zA-Z][a-zA-Z]\d", h)

['pa1', 'PA2', 'Pa3', 'pA4']

Let's use sentence "i" below to illustrate the following metacharacters `\d`, `\D`, `\w`, `\W`, `\s`, `\S`. Try to make a guess of the expected results before running the code:

In [69]:
i = "There are four PAs: PA1, PA2, PA3, PA4. They make 50% of the final grade"

In [70]:
# \d matches any decimal digit character, same as [0-9]
re.findall(r"PA\d", i)

['PA1', 'PA2', 'PA3', 'PA4']

In [71]:
# \D matches any character that is NOT decimal digit, same as [^0-9]
re.findall(r"PA\D", i)

['PAs']

In [72]:
# \w matches any alphanumeric character, same as [a-zA-Z0-9_] 
re.findall(r"PA\w", i)

['PAs', 'PA1', 'PA2', 'PA3', 'PA4']

In [73]:
# \W matches any NON-alphanumeric character, same as [^a-zA-Z0-9_]
re.findall(r"\W", i)

[' ',
 ' ',
 ' ',
 ':',
 ' ',
 ',',
 ' ',
 ',',
 ' ',
 ',',
 ' ',
 '.',
 ' ',
 ' ',
 ' ',
 '%',
 ' ',
 ' ',
 ' ',
 ' ']

In [74]:
re.findall(r"\d\d\W", i) # grab non-word values with \W

['50%']

In [75]:
re.findall(r"\d\d\S", i) # the match above can be done also with \S

['50%']

In [76]:
# \s matches any whitespace character
re.findall(r"\s", i)
re.findall(r"final\sgrade", i)

['final grade']

Note that the character class sequences `\w`, `\W`, `\d`, `\D`, `\s`, `\S` can appear inside a square bracket character class as well, like <code>[0-9]</code> or <code>[a-z]</code>

<b>EXERCISE 2</b>

Extract the phone numbers from the string "l" below (use only what we have learned so far):

In [78]:
l = "My number is 444-3340, but Carl number is 555-3755, Richard number is 666-6681"

In [None]:
re.findall(r"\d{3}-\d{4}", l)
# \d{3} matches any three decimal digits
# - matches the hyphen character
# \d{4} matches any four decimal digits

['444-3340', '555-3755', '666-6681']

<font color='darkblue'> <h2> Quantifiers </h2> </font>

These solutions work but, we can simplify them, by using repeated characters in regexes:

In [80]:
re.findall(r"[0-9]{3}-[0-9]{4}", l)

# re.findall(r"\d{3}-\d{4}", l)

['444-3340', '555-3755', '666-6681']

We simplified our search for phone numbers using so-called quantifiers. <b>Quantifiers</b> allow to specify conditions such that a certain character must occur 0 or more times, 1 or more times, and so on: 

<table width=80%>
    <tr>
        <td width=60px><code> * </code></td>
        <td align="left"> star matches <b>0 or more</b> single character</td>
    </tr>
    <tr>
        <td><code> + </code></td>
        <td align="left"> plus matches <b>1 or more</b> </td>
    </tr>
    <tr>
        <td><code> ? </code></td>
        <td align="left"> question mark matches <b>0 or 1</b> </td>
    </tr>
    <tr>
        <td><code>{2}</code></td>
        <td align="left"> {} matches a specified number of repetitions, here exactly 2 </td>
    </tr>
    <tr>
        <td><code>{2,5}</code></td>
        <td align="left"> between 2 and 5 </td>
    </tr>
    <tr>
        <td width=60px><code>{2,}</code> </td>
        <td align="left"> 2 or more </td>
    </tr>
    <tr>
        <td> <code>{,5}</code> </td>
        <td align="left"> up to 5 </td>
    </tr>
</table>

Let's see a few more examples of quantifiers:

In [81]:
# another way to solve the previous example of sentence "l" using quantifiers

re.findall(r"\d+-\d+", l)

['444-3340', '555-3755', '666-6681']

In [82]:
# we could build in more flexbility, by accomodating for phone numbers of 2 or 3 start digits

re.findall(r"\d{2,3}-\d{4,}", l) # {n, m} at least n times, at most m times

['444-3340', '555-3755', '666-6681']

In [85]:
quantifiers = "zooooom in oh oh!"

In [86]:
# "o*" matches "o" exactly zero or more repetitions of the character "o"

print(re.findall(r"o*", quantifiers))

['', 'ooooo', '', '', '', '', '', 'o', '', '', 'o', '', '', '']


In [87]:
# "o+" matches "o" exactly 1 or more repetitions of the character "o"

print(re.findall(r"o+", quantifiers))

['ooooo', 'o', 'o']


In [88]:
# "o{5}" matches "o" exactly 5 contiguous times

print(re.findall(r"o{5}", quantifiers))

['ooooo']


In [89]:
# "o{,5}" matches "o" between 1 and 5 contiguous times

print(re.findall(r"o{1,5}", quantifiers))

['ooooo', 'o', 'o']


In [90]:
# match "z" followed by one or more word characters, terminated by "m"

print(re.findall(r"z\w+m", quantifiers))

['zooooom']


We can also represent ranges of common values as sets. The following regex matches any series of two letters and one number:

In [91]:
m = "pa1, PA2, Pa3, pA4"
re.findall(r"[a-zA-Z]{2}\d", m) 

['pa1', 'PA2', 'Pa3', 'pA4']

And we can negate values in the set using the caret <code>^</code>: 

In [92]:
re.findall(r"[^a-z]{2}\d", m)

['PA2']

To match the names in the sentence below we can use  <code>[a-z]</code> and literal matches along with the <code>*</code> quantifier which matches zero or more recurrences of a single character: 

In [93]:
n = "#richard said that #bob loved the dinner cooked by @$usan"
re.findall(r"[#@][a-z]*\$*[a-z]*", n) # match zero or more with asterisk

['#richard', '#bob', '@$usan']

NB: Keep in mind that quantifiers like `*` or `+` or `?` apply to the character immediately to the left and not the full set of characters. So in the example above, each `*` applies only to the one character to the left

<b>EXERCISE 3</b>

Build a regular expression pattern that matches all three emails in this string:

In [95]:
email = "smith90@uchicago.edu nardin@uchicago.edu happy2@uchicago.edu"

In [96]:
re.findall(r"\w+@\w+\.\w+", email)

['smith90@uchicago.edu', 'nardin@uchicago.edu', 'happy2@uchicago.edu']

<font color='darkblue'> <h2> Greedy & Non-Greedy matching </h2> </font>

So far, we've been doing a form of matching called greedy matching: this is the default matching for the regex quantifiers.

Examples of <b>Greedy</b> quantifiers are ?, *, +, and {,}: 
* They match <b>as many characters as possible</b>, and return the longest string at the first match that matches the pattern (e.g., the regex goes to the end of string, then backtracks until it matches end pattern)
* Example: the regex "a+" will match as many "a" as possibile in the string "aaaa" even even though the substrings "a", "aa", "aaa" all match the regex "a+"

Examples of <b>Lazy</b> or <b>Non-Greedy</b> quantifiers are ??, *?, +?, and {,}?:
* They match <b>as few characters as possibile</b>, and stop at the first recurrence of a character (e.g., the regex moves forward through the string one character at a time, and stops at the first match)
* Example: the regex 'a+?' will match as few "a" as possible in the string "aaaa". Thus, it matches the first character "a" and is done with it

In [97]:
n = "abc abbbc abbcabbbc"

In [98]:
# greedy matching: match 'a', followed by one-or-more word characters, terminated by 'c'

re.findall(r"a\w+c", n)  

['abc', 'abbbc', 'abbcabbbc']

In [99]:
# lazy matching: match 'a', followed by one-or-more word characters, terminated by 'c'

re.findall(r"a\w+?c", n) 

['abc', 'abbbc', 'abbc', 'abbbc']

Note that "abbcabbc" would never be matched, as it would lazily stop at 'c'. Instead, there are two matches: "abbc" and "abbbbc". 

Another example:

In [100]:
o = "AAAGCGCCCGGGA" 

In [101]:
# greedy matching

re.findall(r"G.*G", o) 

# GCG matches G.*G, but GCGCCCGGG is longer and starts at the same character, 
# so it is matched instead

['GCGCCCGGG']

In [102]:
# lazy matching

re.findall(r"G.*?G", o)

['GCG', 'GG']

<font color='darkblue'> <h2> Groupings and Backreferences </h2> </font>

Matches can be grouped according to patterns, using parentheses. The `group()` function allows the whole match to be separated into groups, corresponding to the different parenthesized parts of the pattern.

Grouping syntax:

<code>match = re.search("(pattern1)(pattern2)", string)

<code>if match:
    <code>print(match.group(1)) # prints the string that matches pattern 1
    <code>print(match.group(2)) # prints the string that matches pattern 2

Grouping example:

In [103]:
match = re.search('(\w+)(\s)(\d.+)', "Apple 3.99")

print(match.group(1))  # group 1 is (\w+)
print(match.group(3))  # group 3 is (\d.+)

Apple
3.99


Use groups to match a sentence:

In [104]:
s = "Hello world!"
s1 = "This works only without commas or apostrophes."
s2 = "Test if s1 is true, with this sentence that contains a comma."

group = r'\w+(\s\w+)*[.?!]'
       
print(re.fullmatch(group, s)) 
print(re.search(group, s1))
print(re.search(group, s2))

<re.Match object; span=(0, 12), match='Hello world!'>
<re.Match object; span=(0, 46), match='This works only without commas or apostrophes.'>
<re.Match object; span=(20, 61), match='with this sentence that contains a comma.'>


Generally speaking, the above pattern <code>\w+(\s\w+)*[.?!]</code> matches a sentence. Specifically, it matches: 
one-or-more word characters followed by zero-or-more repetitions of, a single whitespace character followed by one-or-more word characters, all terminated with a period, question-mark or exclamation point (phew!).

With grouping we can also implement <b>look ahead</b> and <b>look behind</b> syntax:

<table width=100%>
    <tr>
        <td width=80px><code>e1(?=e2)</code></td>
        <td align="left"> <b>Positive Lookahead</b>: match example1 (e1) if text following it IS matched by example2 (e2); does not include e2 in the match</td>
    </tr>
    <tr>
        <td><code>e1(?!e2)</code></td>
        <td align="left"> <b>Negative Lookahead</b>: match example1 if text following it IS NOT matched by example2; does not include e2 in the match</td>
    </tr>
    <tr>
        <td><code>(?&lt;=e1)e2</code></td>
        <td align="left"> <b>Positive Lookbehind</b>: match future text using example2, if previous text IS matched by example1; does not include e1</td>
    </tr>
    <tr>
        <td><code>(?&lt;!e1)e2</code></td>
        <td align="left"> <b>Negative Lookbehind</b>: match future text using example2, if previous text IS NOT matched by example1; does not include e1</td>
    </tr>
</table>

In [105]:
# positive lookahead
# matches "foot" if followed by "print" 

r = 'footprint footstool'
re.findall(r'foot(?=print)', r)

['foot']

In [106]:
# negative lookahead
# matches "foot" if NOT followed by "print"

s = 'footprint footsool'
re.findall(r'foot(?!print)', s)

['foot']

<b>Backreferences</b> allows to reuse parts of regular expressions, <b>by referring back</b> (hence "backreference") to part of the previously captured match. 

For example, the regex <code>\b(\w+)\b\s+\1\b</code> matches repeated words, such as "ciao ciao", because the parentheses in (\w+) capture a word to Group 1 then the back-reference <code>\1</code> matches the characters that were captured by Group 1.

In [107]:
q = 'ciao ciao hope you are doing well'
re.findall(r'\b(\w+)\b\s+\1\b', q)

['ciao']

In [108]:
u = '1234 2323 84208420 9339 11 602601'
re.findall(r'\b(\d+)\1+\s+\b', u)

['23', '8420', '1']

In [109]:
pat = re.compile(r'\b(\d+)\1+\s+\b')
[m[0] for m in pat.finditer(u)]

['2323 ', '84208420 ', '11 ']

Notice that in both cases, sentence "q" and sentence "u", findall() only reports the unique sequence in the ouput. If you want the whole repeated sequence (e.g. "ciao ciao", "2323, 84208420, 11") you need to use <code>finditer()</code>

<font color='darkblue'> <h2> Resources </h2> </font>

For further pratice with regular expressions, I'd recommend the following resources:
* python RE library: https://docs.python.org/3/howto/regex.html
* in-depth explanation of python RE library: https://www.w3schools.com/python/python_regex.asp
* popular regexes page: https://www.regular-expressions.info/tutorial.html
* test your regex before applying it to your text: https://regex101.com/ and https://pythex.org/
* great book (for free at UChicago): Friedl, Jeffrey E. "Mastering Regular Expressions"
* more examples: https://learnbyexample.github.io/py_regular_expressions/cover.html