# Lab Week 1: Regular Expressions

The aim of this lab is to experiment with and learn to use regular expressions in Python. To this effect, you will make use of the `re` module: https://docs.python.org/3/library/re.html. See https://en.wikipedia.org/wiki/Regular_expression for basic background on regular expressions.

Useful tutorials for playing with regular expressions include:
* https://regexr.com/
* https://regex101.com/
* https://www.w3schools.com/python/python_regex.asp

## Simple Example

The following code uses the `re` module to perform some basic pattern matching.

In [4]:
import re

a_string = "Hello world!"
m = re.search("world!", a_string)
if m:
    print("world! found in", a_string)
else:
    print("world! not found in", a_string)

world! found in Hello world!


In [None]:
# What you can't do using regular expressions:
lots_of_parentheses = "(((abc))"
# Can we check whether the number of parentheses is balanced?
# Could write some Python code to do this ...
# Can we regular expressions? No!
re.search("(\(\()abc(<trying to match the same number of closing parentheses...>)", lots_of_parentheses)

In [5]:
# use anchoring at the beginning or end of the string
m = re.search("^world!", a_string)
if m:
    print("world! found at beginning of", a_string)
else:
    print("world! not found at beginning of", a_string)
    
m = re.search("world!$", a_string)
if m:
    print("world! found at end of", a_string)
else:
    print("world! not found at end of", a_string)

world! not found at beginning of Hello world!
world! found at end of Hello world!


In [6]:
# use single-character wildcard "."
m = re.search("world.", a_string)
if m:
    print("world. found in", a_string)
else:
    print("world. not found in", a_string)

world. found in Hello world!


In [29]:
# use zero or more repetitions of wildcard "."
m = re.search("^(.)*world!", a_string)
if m:
    print(".*world! found at beginning of", a_string)
else:
    print(".*world! not found at beginning of", a_string)

.*world! found at beginning of Hello world!


## Tasks
The following code snippets require you to fill in details as specified in the comments.

In [19]:
# "re" offers both .search and .match functions - what is the difference?
search_result = re.search("world", a_string)
match_result = re.match("world", a_string)

In [20]:
# "*" matches zero or more repetitions. How is "+" different?
m1 = re.search("z*", a_string)
m2 = re.search("z+", a_string)

# One can also match a particular number of repetitions - what is the result?
one_or_two_l = re.search("l{1,2}", a_string)

In [21]:
# "." matches any character,
# but often more specific sets or classes of characters should be matched
five_characters = re.search("[a-z]{5}!", a_string)

In [22]:
# When trying to match a character that also is a special character in
# regular expressions, escaping is required:
contains_dot = re.search("\.", a_string)

In [30]:
# Refer to matching groups
a_group = re.search("(.*) world!", a_string)
print("Matched group:", a_group.group(1))
# What would a_group.group(0) refer to?
# What happens if you try to access a_group.group(1)?

# Use groups to extract all the domain names from a list of email addresses:
addresses = ["foo.bar@qmul.ac.uk", "alpha.bravo@se21.qmul.ac.uk",
             "first.middle.last@se18.qmul.ac.uk"]
# Iterate over the list and use a regular expression to extract each domain name,
# i.e., the part after the "@" character.

# As a further exercise, try to extract first and last name from each of the above addresses.

Matched group: Hello


In [24]:
# Limiting matching of repetition
comma_separated_values = "foo,bar,baz"
# What do groups 1 and 2 refer to in the following?
more_groups = re.search("(.*),(.*)", comma_separated_values)

In [28]:
# .sub enables replacing text - what is the effect of the following?
replaced = re.sub("[aeiou]", "V", comma_separated_values)

The following is a task that requires using regular expressions as part of text analysis:

1) Find or write a piece of text (a paragraph will do).
2) In this text, find all words that start with a vowel.
3) In this text, find all words that have at most 5 characters.
4) Build a bar chart that depicts the distribution of words of less-than-three, three, four, five, more-than-five characters.