Open in colab: https://colab.research.google.com/github/brochhagen/nlpupf/blob/main/material/2026q2/session02/notebook0201-regex.ipynb

# Regular expressions

Regular expressions (REs or regex) are character sequences used for information retrieval. They are commonly used to retrieve strings that match a pattern; or to substitute strings that match a pattern with another string.

Regular expressions are used in most operating systems and programming languages. They are fast and convenient.

In [None]:
import re #python's package for REs
source =  'I went up to my bedroom. Absurd though the gesture was, I closed and locked the door.' #sentence from Borge's The Garden of Forking Paths

In [None]:
#findall returns a list of all occurences of the pattern (first argument) in the string (second arg.)
target = re.findall('I', source)
print(target)

In [None]:
#sub returns the string, substituting pattern (1) by pattern (2) in source (3)
target = re.sub('I', 'We', source)
print(target)

In [None]:
#sub returns the string, substituting pattern (1) by pattern (2) in source (3)
target = re.sub('I', 'we', source)
target = re.sub('^we', 'We', target)
print(target)

<div class="alert alert-block alert-success"> <b>Activity.</b> <br>
    <ol>1. Count how many spaces there are in "source"</ol>
    <ol>2. Replace all the spaces in "source" with "SPACE" </ol>
</div>

***

![](regular_expressions.png)


## A couple of notes on regular expressions

### Greedy vs. lazy

In [None]:
# Greedy matching
source = 'This string is composed of four sentences. One. Two. Three.'

re.sub('\..* ', '', source) #greedy substitution: substitute the largest match

In [None]:
re.sub('\..*? ', '', source) #lazy substitution: substitute the shortest match

### String formatting in python

In python, `\` has a special meaning. For instance, you can use it to format your strings:

In [None]:
print('This string \nhas a line break')

This is important to know because patterns in regex use `\`. For instance, `\b` is a word boundary. If you want `python` to not interpret `\` as a special symbol so that it is passed to `re` correctly, you can either escape the special character (i) using a backspace `\\b` or (ii) forcing `python` to interpret the string as a so-called raw string (e.g., `r'this is a \b raw string'`)

In [None]:
#Option 1:
print('This string \\nhas no line break')

#Option 2:
print(r'This string \nhas no line break')

## A speech

In what follows, we will practice regex based on the most recent inaugural presidential speech from the USA. We begin by scraping them from the webpage https://www.whitehouse.gov to get the speech.

It is not necessary that you understand all the steps to do the scraping. At this early stage, the important thing is to take note of what can be done; not how it is done. If you're curious to learn more, the package `BeautifulSoup` is a famous webscraping library for `python`.


In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
#Biden speech
URL = "https://web.archive.org/web/20220328085206/https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/01/20/inaugural-address-by-president-joseph-r-biden-jr/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="content")
biden_speech = list(results.find_all("p"))
biden_speech = [str(x) for x in biden_speech]

In [None]:
biden_speech[:5]

<div class="alert alert-block alert-success"> <b>Activity.</b> <br>
    <ul>
    <ol>1. Format "biden_speech" so that you can conveniently apply regex on it</ol>
    <ol>2. Clean the speech using regex. Minimally, remove HTML tags</ol>
    <ol>3. Count how many times the following strings appear in the speech:  
      <ul>
          <li>The word "America"</li>
          <li>The stem "America-" (e.g., "America" but also "American")</li>
          <li>The word "we", case-insensitive</li>
          <li>The word "I", case-insensitive</li>
          <li>The letter "i", case-insensitive</li>
      </ul>
    </ol>
    </ul>
</div>

***