# Regular Expressions Lab

Regular expressions (aka regexes or regexps) are text matching patterns described with a formal syntax that were covered in Lecture 3. This lab covers two main issues
- Basic Python Capabilities for Using Regular Expressions
- How to use regular expressions for a variety of tasks

## Basic Python Capabilities for Regular Expressions

Regular expressions are implemented in the `re` package, which provides a number of functions to match regexps.


In [29]:
import re

# List of patterns to search for

# Text to parse
text = 'This is a string with term1.'
#returns a match object which can be used as a boolean in a boolean context
if re.search("term[0-4]",  text):
        print('Found first term.')
if re.search("term[5-9]",  text):
        print ('Found second term.')

Found first term.


Now we've seen that `re.search()` will take the pattern, scan the text, and then returns a **Match** object corresponding to the first match. If no pattern is found, a **None** is returned. Both of these are truthy, so any successful match (even a match on a null string from an ill-advised Kleene \*) will evaluate to `True` in a `Boolean` context. To give a clearer picture of this match object, check out the cell below:

In [30]:
#returns the first matching sequence it finds, i.e. here the first digit sequence found is 3225 so this is returned
match = re.search("\d+",  "This is the COMP3225 module's 1st lab")

#The percent thing in python means the string takes arguments like here
print("The regexp matched '%s' between positions %s and %s" % (match.group(0), match.start(), match.end()))

The regexp matched '3225' between positions 16 and 20


If you used capture groups in the regular expression, they will appear as arguments `1` up to `99` of the `match.groups()` method. 

In [31]:
#return the first thing that matches, i.e. a capture group of 1 or more digits, followed by any number of any digits up to an identical capture group of one or more digits
match = re.search("(\d+).*(\d+)",  "This is the COMP3225 module's 1st lab")

#group(0) will return the entire expression matched
print("The regexp matched '%s' and '%s'" % (match.group(1), match.group(2)))

The regexp matched '3225' and '1'


There are three variants of the search function:
- `re.match()` is anchored at the beginning of the search string
- `re.fullmatch()` is anchored at the beginning and the end of the search string
- `re.findall()` return all matches 

You can also look for all the matches in a string with `re.findall()`, but it returns a list of the actual strings matched rather than a `Match` object.

In [32]:
match = re.findall("\d+",  "This is the COMP3225 module's 1st lab")

print("The regexp matched '%s'" % (match))

The regexp matched '['3225', '1']'


To embed this in a _file-read-and-match-print-results_ code fragment that works line by line, we can do the following

In [33]:
fname="mytest.txt"

#this will read the file as a stream one line at a time, bound to the variable in the for loop iteration
for line in open(fname,"r"):
    match=re.findall("\d+", line)
    if match: print(match)

FileNotFoundError: [Errno 2] No such file or directory: 'mytest.txt'

Finally, there are four usefull flags you can use to turn on different features in Python's implementation of regular expressions. 
- VERBOSE. Allow inline comments and extra whitespace.
- IGNORECASE. Do case-insensitive matches.
- DOTALL. Allow dot (.) to match any character, including a newline. (The default behavior of dot is to match anything, except for a newline.)
- MULTILINE. Allow anchors (^ and $) to match the beginnings and ends of lines instead of matching the beginning and end of the whole text.

They can be provided as flags to the re methods
- re.match("this", "This", flags=re.IGNORECASE)

or as abbreviated flags
- re.match("that", "That", flags=re.I)

or as inline flags in the regular expression itself
- re.match("(?i)those", "Those")


## Practicing Regular Expressions

You should now have a good enough understanding of how to use the regular expressions module in Python to undertake the rest of the tasks in the lab. You should refer to the full [Python regular expression documentation](https://docs.python.org/3/library/re.html)  for more detail.

Note that so far we haven't needed to use multiline or raw Python strings, or verbose regular expressions. As things get more complex, you might need to.

The following six tasks consist of three skills tasks and three application tasks. You should attempt all the skills tasks (1-3) to your satisfaction, and choose any of the application tasks (4-6) for general practice in using regular expressions to solve problems that require complex lexical analysis.

Please note: there are no right answers set for these tasks - no specific precision or recall targets that you have to meet. The intention is that you focus on practicing the use of regular expressions, and gain experience with applying different kinds of "patterns" to a range of situations.

## Task 1 - recognising specific lexical tokens
Recognise the following particular kinds of non-word lexical tokens, using your experience to define the allowable format of each token type. Create your own test data files.

1. Hashtags (which can contain alphanumeric characters, underscores and hyphens). Test on the text of some tweets that you copy and paste from twitter.com.
1. UK Postcodes like SO17 1BJ or E1 8BN.
1. UK style phone numbers like 023 80591234 or 02380 590000 or +44 (0) 23 80594479 or mobile numbers like 07722 175921.
1. URLs http://google.com/ or https://a.b.net/c/d/e.php#fragment.
1. Email addresses like lac@ecs.soton.ac.uk, and decompose into username and internet domain.

In [34]:
#1: Hashtags

#a hashtag is a hash character followed by any number of non space characters
hashtagMatch = re.search(r"#\S+","Having a lovely evening with my friends in #wetherspoons xxx")
#hashtagMatch = re.match(r"^#\S+$", "Having a lovely evening with my friends in #wetherspoons xxx")
if hashtagMatch:
  #group 0 is the entirity of the expression that was matched
  print("match successful: %s" % hashtagMatch.group(0))

match successful: #wetherspoons


In [35]:
#2: UK Postcodes like SO17 1BJ or E1 8BN

#tested using findall aswell
postcodes = ["GU21 7RN", "GU22 7UR", "SO17 2HR", "W13 0EB"]
findAllStr = "GU21 7RN  GU22 7UR  SO17 2HR   W13 0EB"


regex = r"[A-Z]{1,2}\d{2}\s+\d[A-Z]{2}"

for postcode in postcodes:

  print("Postcode matched: %s" % re.search(regex, postcode).group(0))

print("now going to test findall")
print(re.findall(regex, findAllStr))

Postcode matched: GU21 7RN
Postcode matched: GU22 7UR
Postcode matched: SO17 2HR
Postcode matched: W13 0EB
now going to test findall
['GU21 7RN', 'GU22 7UR', 'SO17 2HR', 'W13 0EB']


In [36]:
#3 UK style phone numbers like 023 80591234 02380590000 +44 (0) 23 80594479 or mobile numbers like 07722175921

#Not clear what the hard and fast rule for the format of UK phone numbers is?

In [37]:
#4 URLS

#needs to be at least one name "dot" something e.g. google.com, followed by any number of alphanum strings seperated by /, . or #, then lookahead to see if there is 
#a space character which terminates the regular 
regex = r"https?://[a-z0-9]+\.[a-z]+(?:[#/\.][a-z0-9]+)*"
urls = ["http://google.com/ ", "https://a.b.net/c/d/e.php#fragment "]
findAllStr = "http://google.com/ https://a.b.net/c/d/e.php#fragment"
print("going to test findall: ", re.findall(regex, findAllStr))

going to test findall:  ['http://google.com', 'https://a.b.net/c/d/e.php#fragment']


In [38]:
#Email addressess

#make sure to put the username and domain in capture groups so that we can access them
#make sure to escape the dot character as we want to escape a literal dot
emailMatch = re.search(r"([a-z0-9]+)@([a-z]+(\.[a-z]+)+)", "gg2g17@soton.ac.uk")
if emailMatch:
  print("Email match successfull, username: %s, domain: %s" % (emailMatch.group(1), emailMatch.group(2)))

Email match successfull, username: gg2g17, domain: soton.ac.uk


If that was too easy for you, look up the rules for the officially allowable formats of each token type in Wikipedia. Here for example, are the official UK postcode formats:

| Format	    | Coverage	| Example |
| -------:      | :--------:  | :------- |
| AA9A 9AA	| WC postcode area; EC1–EC4, NW1W, SE1P, SW1	| EC1A 1BB |
| A9A 9AA	| E1, N1, W1	| W1A 0AX |
| A9 9AA	| B, E, G, L, M, N, S, W	| M1 1AE |
| A99 9AA	| " | B33 8TH |
| AA9 9AA	| All other postcodes	| CR2 6XH  |
| AA99 9AA	| " | DN55 1PT |

The rules for email adddresses are eye-watering!

In [1]:
#examples of useful os commands
import os

In [2]:
os.getcwd()

'/home/george/Documents/GitHub/COMP3225-NLP/src/lab1regex'

In [60]:
os.chdir("Colab")

In [5]:
os.listdir(rootDir)

['uk-weather-met-office-warns-of-storms-and-exceptional-rainfall.txt',
 'richie-mawson-was-a-beloved-dad-and-liverpool-fan-did-a-late-lockdown-cost-him-his-life.txt',
 'malta-may-demand-return-of-fossil-given-to-prince-george-by-david-attenborough.txt',
 'face-mask-rules-more-political-than-scientific-says-expert.txt',
 'stories-of-jobseekers-show-true-impact-of-covid-19-on-employment.txt',
 'winchester-school-bus-crash-children-seriously-injured.txt',
 'former-immigration-minister-criticises-use-of-barracks-to-house-asylum-seekers.txt',
 'mixed-reactions-to-crowds-at-bournemouth.txt',
 'ex-paratrooper-jumps-40m-from-helicopter-for-charity.txt',
 'end-the-prejudice-against-travellers-police-chief.txt',
 'oil-tanker-stormed-by-sbs-was-denied-port-access-by-france-and-spain.txt',
 'weatherwatch-march-blizzard-england-surprise-1970.txt',
 'authorities-failed-to-protect-hampshire-girl-lucy-mchugh-stabbed-to-death-by-abuser-review-finds.txt',
 'woman-held-after-body-of-newborn-baby-found-ne

In [4]:
rootDir = "guardian/ARTICLES.d/"

## Task 2 - numeric tokens

The zip file [guardian.zip](https://secure.ecs.soton.ac.uk/notes/comp3225/labs/guardian.zip) contains the text extracted from 118 Guardian news stories from the last year about Southampton, Portsmouth and Winchester. (One story per file, but you might like to combine them into a single file for ease of processing.) Starting from the regexp tokenizer example in Fig 2.12 (reproduced below), extend the set of tokens recognised to capture the following types of numeric data.
- Numbers: -12  47.2  74,832,101
- Time: 09:17pm
- Money: £27.8m £8bn
- Length: 6ft 48cm
- Phone: +44(0)2380594479
- Age specification: 13-year-old
- Percentage: 14.4%
- Temperature: 28C
- Ordinals: 48th 22nd 1st

You can compare your results with the [list of numeric tokens that I managed to extract](https://secure.ecs.soton.ac.uk/notes/comp3225/labs/guardian-numerics.txt). What’s the biggest financial quantity that appears in these stories? What did it relate to? What are the most common numeric tokens, and why do they appear?

In [62]:
#got the money pattern to work but cant get it to work as part of larger regex?

res = re.search(r"£\d+(\.\d{1,3})?(?:m|k|bn)", "fasdfasfas£27.8bndfghdhgd")
print(res)

<re.Match object; span=(10, 17), match='£27.8bn'>


In [65]:
import nltk

text='That U.S.A. poster-print costs $12.40... -1,247.2 -1,111.233 09:17am £27.8m'

#verbose regular expressions means whitespace is ignored unless escaped or in a [] group as shown below
#and also allows you to add hash comments as shown below

# The following pattern is reproduced from the textbook figure 2.12.
# UNFORTUNATELY, the behaviour of NLTK has changed since version 3.1,
# so that capture groups don't work any more and every set
# of grouping parentheses () has now to explicitly declare non-capturing semantics with ?:
pattern = r'''(?x)			# set flag to allow verbose regexps 
	 
     #force the comma seperated format, but force 2 decimals as this makes sense for money
     #£\d+(\.\d{1,3})?(?:m|k|bn)
     
     #time is easy!
     \d{2}:\d{2}[ap]m 
     
     | (?:[A-Z]\.)+			# abbreviations, e.g. U.S.A. 
	 | \w+(?:-\w+)*			# words with optional internal hyphens 
	 | \$?\d+(?:\.\d+)?%?	# currency and percentages, e.g. $12.40, 82% 
	 | \.\.\.				# ellipsis 
	 | [][.,;"'?():-_`]		# these are separate punctuation tokens; includes ], [ 
     
     #numbers that comma seperate thousand bases and also allow for optional decimals
     | -?\d{1,3}(?=\D)(?:,\d{3})*(?:\.\d+)? 
     
     #numbers that omit commas and allow for optional decimals
     | -?\d+(?:\.\d+)?
     
     
 
     
	 '''

#nltk allows you to tokenize on spaces and then also match a regex on each token as you are tokenizing
tokens=nltk.regexp_tokenize(text, pattern)
print(tokens)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...', '-1,247.2', '-1,111.233', '09:17am', '27', '.', '8m']


## Task 3 - lookahead and lookbehind assertions
Use the lookaround capabilities to match the following (see the lookaround slide in lecture 3 as a start)
- a password with at least 6 characters, containing 2 upper case letters, two digits and two punctuation marks
- the above password, but it can't start with AB, ab, 01 or 12 to trap the really obvious abcd... and 1234... passwords
- separate out the different components of a camelCase word (e.g. getElementById -> [get, Element, by, Id])
-- Hint: identify each location where the previous character is lowercase and the next character is uppercase

In [82]:
#password task

regex = r"""(?x)
    (?!AB|ab|01|12)
    (?=\S{6})
    (?=(.*[A-Z]){2})
    (?=(.*[0-9]){2})
    (?=(.*[!@#$%&]){2}) #lets suppose this is the set of punctuation marks for now
    .*
    """
string = "PA$$word13"

result = re.match(regex, string)
print(result)

<re.Match object; span=(0, 10), match='PA$$word13'>


In [105]:
#if we look behind and see a lowercase letter or digit followed by an uppercase character, then this is a word
#"inside" the camelcase, otherwise if we dont see a lowercase letter when we look behind then clearly we are 
#at the start of the camelcase so consume a lowercase alphanumeric string starting with an a-z letter
regex = r"""(?x)
    (?<=[a-z0-9])[A-Z][a-z0-9]+
    | [a-z][a-z0-9]*
    """
string = " myCamelCase"
match = re.findall(regex, string)
print(match)

['my', 'Camel', 'Case']


## Optional Task 4 - Analysing Lightbulb Jokes
There were more than 10k lightbulb jokes told on Twitter in in 2015. Here are a couple of examples of the genre.
- Q: How many thriller writers does it take to change a lightbulb? A: Two. One to screw it almost all the way in and another to give a surprising twist at the end.
- Do you know how many folk musicians it takes to change a lightbulb? Five. One to change the lightbulb, and four to write songs about how much better the old bulb was.

The standard pattern for the opening of a lightbulb joke is _How many X does it take to change a lightbulb?_

The file [lightbulbs-2015.txt](https://secure.ecs.soton.ac.uk/notes/comp3225/labs/lightbulbs-2015.txt) contains one lightbulb joke tweet per line.
Write a python code that uses regular expressions to isolates the topic X of each of the jokes and use it to produce a summary of the top 100 topics of lightbulb humour.
- You should throw away the answers (punchlines) to the joke, just look for the topic
- Hint: you will need to use trial and error to deal with variations in language, case and punctuation. You may want to test sets of regular expressions in an editor such as VI to allow you to see the coverage and special cases before you commit them to python code.
- This is unmoderated Internet gathered data. Apologies for any inappropriate language that it might contain, or any examples of offensive humour. If you find anything that should be removed, please let me know.

The file [lightbulbs-2020.txt](https://secure.ecs.soton.ac.uk/notes/comp3225/labs/lightbulbs-2020.txt) contains the same data for last year. Use the regular expressions you developed to generate a top 100 summary for 2020 and compare the two years. What significant changes in topics have there been between 2015 and 2020?

In [136]:
#we are ONLY interested in the part that starts with how many, so throw away any parts before
regex = r".*[Hh]ow\s+many\s+(((?!does)[a-zA-Z]+\s+)*)"
matched = re.match(regex, inp)
print(matched.group(1))

AttributeError: 'NoneType' object has no attribute 'group'

In [None]:
#download the file beforehand
fname="lightbulbs.txt"

#make the right regular expression to isolate just the topic
pattern=r"How\s+many\s+(((?!does)[a-zA-Z]\s+)*)"



lineN=0; matchN=0
for line in open(fname,"r"):
    lineN+=1
    match=re.search(pattern, line)
    if match:
        matchN+=1
        print(match.group(1))
print("The regexp matched %s jokes out of %s" % (matchN, lineN)) 

#now go on and do the analysis

The regexp matched 0 jokes out of 10042


## Optional Task 5 - citations

The file [thesis.txt](https://secure.ecs.soton.ac.uk/notes/comp3225/labs/citations-thesis.txt) contains the text of a PhD thesis. The examiner wants to know how many papers the student has cited, and of those, how many are “recent” (i.e. from the last decade).
The bibliographic style used means that citations appear in the text with the basic form (surnames year), and you can see examples of this as they appear in the PDF <img src="https://secure.ecs.soton.ac.uk/notes/comp3225/labs/citations-sample.png" width="400">

Write regular expressions to extract the list of citations from the thesis, answering the examiner's questions above.

In [None]:
#code

### Optional Task 6 - ELIZA

ELIZA was an early natural language processing computer program created in 1964 (the year I was born) at the MIT Artificial Intelligence Laboratory and created to demonstrate superficial communication techniques (OMG - it's like we're twins).

Eliza simulated conversation by using a "pattern matching" and substitution methodology that gave users an illusion of understanding on the part of the program. Famously ELIZA simulated a style of psychotherapist, well-known for simply parroting back at patients what they had just said, and used rules to respond with non-directional questions to user inputs. As such, ELIZA was one of the first chatterbots and one of the first programs capable of attempting the Turing test.

You should implement an ELIZA-like conversational program, using substitutions such as those described on page 11 in the text book.

In [None]:
#code