# Exercises for Regular Expressions

### Load Libraries

In [None]:
import pandas as pd
import re

Here's the match function, which uses `re.search()` to let you know if a regular expression matches some string.

In [None]:
def match(regex, string):
    print(string + ": " + str(re.search(regex, string) != None))

___
## 6.1: Intro to Regular Expressions

### Basics
All of these questions can be answered only using:
* Standard alphanumeric characters,
* The `.` wildcard,
* The `|` pipe, and
* Bracket `[]` matching.

For each of the segments below, write a regular expression that will match each of the strings *except* the ones indicated. See if you can verbally describe what needs to be matched.

##### Part A.

In [None]:
# Change this code to include the correct regex.
regex = r"abc"

In [None]:
# Don't change the code in these cells!
print('regex = r"' + regex + '"')
match(regex, "abc")
match(regex, "abcdef")
match(regex, "abcdefghi")
match(regex, "   defghi") # Don't match this one!

regex = r"abc"
abc: True
abcdef: True
abcdefghi: True
   defghi: False


Predict: will your regex match the text below?

In [None]:
match(regex, "abracadabra") #It wont

abracadabra: False


##### Part B.

In [None]:
regex = r"apple"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "applesauce")
match(regex, "apple juice")
match(regex, "pineapple")
match(regex, "apfelstrudel") # Don't match this one!

regex = r"apple"
applesauce: True
apple juice: True
pineapple: True
apfelstrudel: False


Predict: will your regex match the text below?

In [None]:
match(regex, "snapple") #YES

##### Part C.

Here, I want to match the letters "nt" with at least one letter before them.

In [None]:
regex = r".nt"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "ant")
match(regex, "ent")
match(regex, "runt")
match(regex, "nt") # Don't match this one!
match(regex, "t") # Don't match this one!

regex = r".nt"
ant: True
ent: True
runt: True
nt: False
t: False


Predict: will your regex match the text below?

In [None]:
match(regex, "I can't even.") #no

I can't even.: False


##### Part D.

In [None]:
regex = r"f..r"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "four")
match(regex, "fair")
match(regex, "fairy")
match(regex, "flower") # Don't match this one!
match(regex, "flour") # Don't match this one!

regex = r"f..r"
four: True
fair: True
fairy: True
flower: False
flour: False


Predict: will your regex match the text below?

In [None]:
match(regex, "affair")

affair: True


##### Part E.

Here, we're filtering out words with fewer than 3 letters.

In [None]:
regex = r"..."

In [None]:
print('regex = r"' + regex + '"')
match(regex, "This")
match(regex, "is") # Don't match this one!
match(regex, "an") # Don't match this one!
match(regex, "adorable")
match(regex, "cat")

regex = r"..."
This: True
is: False
an: False
adorable: True
cat: True


Predict: will your regex match the text below?

In [None]:
match(regex, "8 7")#yes

8 7: True


##### Part F.

In [None]:
regex = r"abc|xyz"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "abc")
match(regex, "abcdef")
match(regex, "xyz")
match(regex, "uvwxyz")
match(regex, "defuvw") # Don't match this one!

regex = r"abc|xyz"
abc: True
abcdef: True
xyz: True
uvwxyz: True
defuvw: False


##### Part G.

In [None]:
regex = r"c.t|b.g"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "cat")
match(regex, "cot")
match(regex, "cut")
match(regex, "big")
match(regex, "bug")
match(regex, "bags")
match(regex, "court") # Don't match this one!
match(regex, "bungie") # Don't match this one!

regex = r"c.t|b.g"
cat: True
cot: True
cut: True
big: True
bug: True
bags: True
court: False
bungie: False


Predict: will your regex match the text below?

In [None]:
match(regex, "cbtg")

cbtg: True


##### Part H.

In [None]:
regex = r"[aou]"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "cat")
match(regex, "cet") # Don't match this one!
match(regex, "cit") # Don't match this one!
match(regex, "cot")
match(regex, "cut")

regex = r"[aou]"
cat: True
cet: False
cit: False
cot: True
cut: True


##### Part I.

These match any sentence that has at least one punctuation mark in it.

In [None]:
regex = r"[\.\?\[\]\(\)]"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "xyz.")
match(regex, "what?")
match(regex, "[this]")
match(regex, "(that)")
match(regex, "not this") # Don't match this one!
match(regex, "or that") # Don't match this one!

regex = r"[\.\?\[\]\(\)]"
xyz.: True
what?: True
[this]: True
(that): True
not this: False
or that: False


### Repetition and Shorthand

These questions will add in the:
* The `*`, `+`, and `?` characters for indicating repetitions, and
* Shorthand `\s`, `\w`, `\d` (and uppercase variants).

##### Part J.

In [None]:
regex = r"a+b+"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "aaaaaaaab")
match(regex, "aaaab")
match(regex, "aab")
match(regex, "ab")
match(regex, "bbbbb") # Don't match this one!
match(regex, "aa") # Don't match this one!
match(regex, "bbaa") # Don't match this one!

regex = r"a+b+"
aaaaaaaab: True
aaaab: True
aab: True
ab: True
bbbbb: False
aa: False
bbaa: False


Predict: will your regex match the text below?

In [None]:
match(regex, "aabbaa")#yes

aabbaa: True


##### Part J.

In [None]:
regex = r"a"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "aaaaaaaab")
match(regex, "aabbbb")
match(regex, "aaaaaaaaaa")
match(regex, "abbbbb")
match(regex, "bbbbb") # Don't match this one!

regex = r"a"
aaaaaaaab: True
aabbbb: True
aaaaaaaaaa: True
abbbbb: True
bbbbb: False


##### Part K.

In [None]:
regex = r"\w\d"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "A2")
match(regex, "C7")
match(regex, "d9")
match(regex, "q27")
match(regex, "7a") # Don't match this one!
match(regex, "a 3") # Don't match this one!

regex = r"\w\d"
A2: True
C7: True
d9: True
q27: True
7a: False
a 3: False


##### Part L. Course Rubrics.

Make a regular expression that matches any legal Binghamton course number. These are between 2 and 4 letters, a space, and a 3-digit number.

In [None]:
regex = r"\d\d\d"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "HARP 210")
match(regex, "MATH 147")
match(regex, "HARP 325")
match(regex, "ANTH 200")
match(regex, "CS 140")
match(regex, "ABCDE 10") # Don't match this one!

regex = r"..|...."
HARP 210: True
MATH 147: True
HARP 325: True
ANTH 200: True
CS 140: True
ABCDE 10: True


Predict: will your regex match the text below?

In [None]:
match(regex, "M4TH 147")# true

M4TH 147: True


##### Part M.

I want to match any string that contains the letters A, B, and C, in that order, ignoring case and letters in between. See the examples:

In [None]:
regex = r"abc|\s"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "abc")
match(regex, "AlaBaster Castle")
match(regex, "A b c D e f G")
match(regex, "CAndleaBra") # Don't match this one!

regex = r"abc|\s"
abc: True
AlaBaster Castle: True
A b c D e f G: True
CAndleaBra: False


##### Part N.

In [None]:
regex = "S.+S"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "SOS")
match(regex, "S fhsdfhsmnvs S")
match(regex, "SfdfS")
match(regex, "SS") # Don't match this one!

regex = r"S.+S"
SOS: True
S fhsdfhsmnvs S: True
SfdfS: True
SS: False


##### Part O. Email Addresses

The goal of this exercise is to match any string that contains a *valid email address*. Try and match the examples below, but not the user handle or the URL, both of which contain similar patterns.

In [None]:
regex = r"\w+@\w+\.\w+"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "frances3@binghamton.edu")
match(regex, "mike.chang@gmail.com")
match(regex, "george_sanders@cambridge.co.uk")
match(regex, "Please forward email to empire1@hotmail.com")
match(regex, "Follow me on Instagram @kitchen.xyz") # Don't match this one!
match(regex, "http://www.binghamton.edu/") # Don't match this one!

regex = r"\w+@\w+\.\w+"
frances3@binghamton.edu: True
mike.chang@gmail.com: True
george_sanders@cambridge.co.uk: True
Please forward email to empire1@hotmail.com: True
Follow me on Instagram @kitchen.xyz: False
http://www.binghamton.edu/: False


##### Part P. Phone Numbers

Another useful application: does this string contain a phone number? For simplicity, assume all phone numbers contain 10 digits, grouped into sets of 3, 3, and 4.

In [None]:
regex = r"\d\d\d.*\d\d.*\d\d\d\d.*"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "555-217-3242")
match(regex, "5557621430")
match(regex, "555 213 4217")
match(regex, "(555)213-4217")
match(regex, "555-21-3333") # Don't match this one!
match(regex, "55-321-333") # Don't match this one!

regex = r"\d\d\d.*\d\d.*\d\d\d\d.*"
555-217-3242: True
5557621430: True
555 213 4217: True
(555)213-4217: True
555-21-3333: True
55-321-333: False


### Anchors

This final set of questions includes the anchor characters `^`, `$`, and `\b`.

##### Part Q.
Write a regular expression that matches any string that starts with an uppercase letter.

In [None]:
regex = r"[A-Z]"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "Green Eggs and Ham")
match(regex, "AWESOME")
match(regex, "555") # Don't match this one!
match(regex, "green eggs") # Don't match this one!

regex = r"[A-Z]"
Green Eggs and Ham: True
AWESOME: True
555: False
green eggs: False


##### Part R. Email Addresses

Modify your email address regex so that the *entire string* must be a single, valid email address.

In [None]:
regex = r"^\S*@\S+\.\S+$"


In [None]:
print('regex = r"' + regex + '"')
match(regex, "frances3@binghamton.edu")
match(regex, "mike.chang@gmail.com is my address") # Don't match this one!
match(regex, "george_sanders@cambridge.co.uk")
match(regex, "Please forward email to empire1@hotmail.com") # Don't match this one!
match(regex, "Follow me on Instagram @kitchen.xyz") # Don't match this one!
match(regex, "http://www.binghamton.edu/") # Don't match this one!

regex = r"^\S*@\S+\.\S+$"
frances3@binghamton.edu: True
mike.chang@gmail.com is my address: False
george_sanders@cambridge.co.uk: True
Please forward email to empire1@hotmail.com: False
Follow me on Instagram @kitchen.xyz: False
http://www.binghamton.edu/: False


##### Part S.

Match any string which starts with an open parenthesis, *(*, and ends with a closing parenhesis, *)*.

In [None]:
regex = r"^\(.*\)"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "(yes)")
match(regex, "()")
match(regex, "(((()")
match(regex, "Yes (that is correct)") # Don't match this one!
match(regex, ")(") # Don't match this one!

regex = r"^\(.*\)"
(yes): True
(): True
((((): True
Yes (that is correct): False
)(: False


##### Part T.

Match any string which contains a word beginning with the prefix anti-.

In [None]:
regex = r"\b[Aa][Nn][Tt][Ii]"

In [None]:
print('regex = r"' + regex + '"')
match(regex, "ANTIproton")
match(regex, "Antinomian Heresiarch")
match(regex, "The element antimony is atomic number 51.")
match(regex, "Yes I am grANTIng you permission") # Don't match this one!
match(regex, "Enchanting mercantile quarantine chianti") # Don't match this one!

regex = r"\b[Aa][Nn][Tt][Ii]"
ANTIproton: True
Antinomian Heresiarch: True
The element antimony is atomic number 51.: True
Yes I am grANTIng you permission: False
Enchanting mercantile quarantine chianti: False


___

## 6.2: Extracting from Strings

##### Part A. Extracting Email Addresses

Modify your email address regex, then use `re.findall()` to get a list of all the email addresses in the string below.

In [None]:
regex = r"(\b(.+@.+\..+\b))"

In [None]:
re.findall(regex, """
You can contact me at dfaisel@binghamton.edu, or at my personal email account
daltonfaisel@hotmail.co.uk. The other faculty in my department are:
- Jason Sandoval (jsandoval@binghamton.edu)
- Elyse Spiegel (elyse@math.binghamton.edu)
- Gong Yoo (gyoo@mail.naver.com)
""")

[('You can contact me at dfaisel@binghamton.edu, or at my personal email account',
  'You can contact me at dfaisel@binghamton.edu, or at my personal email account'),
 ('daltonfaisel@hotmail.co.uk. The other faculty in my department are',
  'daltonfaisel@hotmail.co.uk. The other faculty in my department are'),
 ('Jason Sandoval (jsandoval@binghamton.edu',
  'Jason Sandoval (jsandoval@binghamton.edu'),
 ('Elyse Spiegel (elyse@math.binghamton.edu',
  'Elyse Spiegel (elyse@math.binghamton.edu'),
 ('Gong Yoo (gyoo@mail.naver.com', 'Gong Yoo (gyoo@mail.naver.com')]

##### Part B. Capitalized Text

Write a regular expression to extract all words that are written IN ALL CAPITAL LETTERS.

In [None]:
regex = r""

In [None]:
re.findall(regex, """
TO WHOM IT MAY CONCERN,

My friend and myself recently purchased sandwiches from your shop and we were
DISGUSTED to find that it was COVERED IN MOLD. We DEMAND that you return
our money immediately.

Sincerely,
Two very ANGRY customers.
""")

##### Part C. Sentence Tokenizer
Write a regular expression to tokenize a string, splitting it into its individual words. These shouldn't be too bad, but the *last one* is quite a challenge to tokenize properly! It requires the use of a [non-capturing group](https://www.pythontutorial.net/python-regex/python-regex-non-capturing-group/) to do perfectly, as far as I can tell. But you can do okay without it.

In [None]:
regex = r""

In [None]:
re.findall(regex,"The rain in Spain stays mainly in the plains.")

In [None]:
re.findall(regex,"What? Haven't you been to New York?")

In [None]:
re.findall(regex,"No, I'dn't've thought of that!")

In [None]:
re.findall(regex,"Say 'I'd love to have dinner' if she asks.")

##### Part D. Footnotes

You're working as part of a team making a program which converts plaintext research papers into pdf using a scripting language, like $\LaTeX$. The way the software works, footnotes are written {enclosed in curly braces} right where the footnote mark$^1$ should be. Write a regex to extract all of the footnotes, but don't include the curly braces themselves.

$^1$ Like this. Here is the source of [this text](https://arxiv.org/abs/2211.00651).

In [None]:
regex = r""

In [None]:
re.findall("""
We present Hubble Space Telescope{Yes, Hubble's still up there} imaging of
14 gas-rich, low surface brightness and ultra-diffuse galaxies (UDGs)
{galaxies contain billions of stars and lots of dark matter} in the field
{not in clusters} at distances of 25-36 Mpc.{surprisingly close} An inspection
of point-like sources {we don't know what they are} brighter than the turnover
magnitude of the globular cluster luminosity function {the what?} and within
twice the half-light radii of each galaxy reveals that...
""")