# **SICSS-Paris 2025**
Tutorial for the course on regular expressions

Regexes, or regular expressions, are a powerful language that help you spot and extract texts according to a pattern. More plainly, they are the ultimate search & replace tool.

Don't forget you can always use regex checkers like [this one](https://regex101.com/).


**Searching for patterns**



In [1]:
import re

#Create text
text = """
User: alice
Email: alice@example.com
Date: 2025-06-21

User: bob
Email: bob@domain.org
Date: 2025-06-20

User: peter
Email: peter@domain.org
Date: 2025-06-21
"""


In [2]:
re.findall?

In [3]:
# Regex to return all lines that contains "alice"
matches = re.findall(r"^.*alice.*$", text, flags=re.MULTILINE)

for line in matches:
    print(line)

#re.MULTILINE = indicates that the text has multiple lines (does not stop at the end of the line)

User: alice
Email: alice@example.com


In [None]:
# Regex to return all lines that contains a date
matches = re.findall(r"^.*\d+.*$", text, flags=re.MULTILINE)

for line in matches:
    print(line)

Date: 2025-06-21
Date: 2025-06-20
Date: 2025-06-21


In [None]:
# Regex that returns the index of the lines with a date

lines = text.split("\n")
list_idx = []
for index, line in enumerate(lines):
  if re.search(r"\d+", line):
    list_idx.append(index)

print(list_idx)

In [None]:
# Regex that returns the position of the elements identified

positions = [(m.start(), m.end()) for m in re.finditer(r"^.*\d+.*$", text, flags=re.MULTILINE)]
positions

[(38, 54), (88, 104), (142, 158)]

In [None]:
# If you have a multiline file, and you want the index of the line, you can create a list

lines = text.split("\n")
[(i, re.findall(r"^.*\d+.*$", lines[i])) for i in range(0, len(lines)) if re.search(r"\d+", lines[i])]

[(3, ['Date: 2025-06-21']),
 (7, ['Date: 2025-06-20']),
 (11, ['Date: 2025-06-21'])]

**Your turn**


In [None]:
# Write a regex to return all email addresses in 'text'


# Write a regex to return all indexes of the email addresses in 'text'



# **Replacing**
You can also use regexes to replace (or extract) content


In [None]:
text = "Hello 123, this is a test 456"
cleaned = re.sub(r"\d+", "[number]", text)
print(cleaned)  # Output: Hello [number], this is a test [number]

Hello [number], this is a test [number]


In [None]:
#Replace URLs with [link]
text = "Visit https://example.com or http://another.net for more info."
replaced = re.sub(r"https?://\S+", "[LINK]", text)
print(replaced)

Visit [LINK] or [LINK] for more info.


In [None]:
#Extract (First name/last name), followed by age
text = "Name: John Doe, Age: 30"
match = re.search(r"Name:\s*(\w+\s\w+),\s*Age:\s*(\d+)", text)
if match:
    name = match.group(1)
    age = match.group(2)
    print(name, age)

John Doe 30


**Your turn**

In [None]:
#Create text
text = """
User: alice
Email: alice@example.com
Date: 2025-06-21

User: bob
Email: bob@domain.org
Date: 2025-06-20

User: peter
Email: peter@domain.org
Date: 2025-06-21
"""
# Write the user name, the email and the date -- separate this by a comma
# Do a line return after every user
# Bonus: export this as a csv

#### Search sentences in the bible with a specific word

In [None]:
import requests
bible = requests.get("https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt").text

In [None]:
bible[0:100]

'1:1 In the beginning God created the heaven and the earth.\r\n\r\n1:2 And the earth was without form, an'

Size of the bible in letters

In [None]:
len(bible)

4451368

Extract sentences containing a word

In [None]:
r = re.findall(r"([^.?!]*\bearth\b[^.?!]*[.?!])", bible, flags=re.MULTILINE)
r[0:10]

['1:1 In the beginning God created the heaven and the earth.',
 '\r\n\r\n1:2 And the earth was without form, and void; and darkness was upon\r\nthe face of the deep.',
 '\r\n\r\n1:11 And God said, Let the earth bring forth grass, the herb yielding\r\nseed, and the fruit tree yielding fruit after his kind, whose seed is\r\nin itself, upon the earth: and it was so.',
 '\r\n\r\n1:12 And the earth brought forth grass, and herb yielding seed after\r\nhis kind, and the tree yielding fruit, whose seed was in itself, after\r\nhis kind: and God saw that it was good.',
 '\r\n\r\n1:14 And God said, Let there be lights in the firmament of the heaven\r\nto divide the day from the night; and let them be for signs, and for\r\nseasons, and for days, and years: 1:15 And let them be for lights in\r\nthe firmament of the heaven to give light upon the earth: and it was\r\nso.',
 '\r\n\r\n1:17 And God set them in the firmament of the heaven to give light\r\nupon the earth, 1:18 And to rule over the day and

# **Your turn**
1. Find all instances of earth, irrespective of capital letters
2. Find the sentences that have earth and god.
3. Count the number of sentences with god, adam, snake

In [None]:
r = re.findall(r"(?i)([^.?!]*\bearth.*god\b[^.?!]*[.?!])", bible, flags=re.MULTILINE)
r[0:10]