# Python for Everybody
## Chapter 11: Regular Expressions

### Exploring Regular Expressions

Importing the $\texttt{re}$ library

In [1]:
import re

Searching the lines that *contain* $\texttt{'From'}$

In [4]:
import re
hand = open("../texts/mbox-short.txt")
for line in hand:
    line = line.rstrip()        # Removing the trailing whitespace
    if re.search('From: ', line):
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


Searching for lines that *start with* $\texttt{'From'}$:

In [11]:
import re
hand = open("../texts/mbox.txt")
for line in hand:
    line = line.rstrip()
    if re.search('^From: ', line):
        print(line)


From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

### 11.1 Character Matching in regular expressions

 Search for lines that start with $\texttt{'F'}$, followed by 2 characters, followed by $\texttt{'m:'}$

In [9]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'

import re
hand = open('../texts/mbox.txt')
criteria = '^F..m:'                 # Search for F followed by 2 char, followed by m
for line in hand:
    line = line.strip()
    if re.search(criteria, line):
        print(line)
        break


From: stephen.marquard@uct.ac.za


Search for lines that start with $\texttt{From}$ and have an $\texttt{@}$ sign:

In [8]:
# Search for lines that start with From and have an at sign

import re
hand = open('../texts/mbox.txt')
criteria = "^From:.+@"              # Searches for From, followed by >= 1 chars, followed by @-sign
for line in hand:
    line = line.rstrip()
    if re.search(criteria, line):        
        print(line)
        break   

From: stephen.marquard@uct.ac.za


### 11.2 Extracting Data using RE

Extracting all email addresses from a line:

\ $\texttt{\\S+}$ matches as many non-whitespace charactars as possible, followed by $@$, and again followed by as many non-whitespaces as possible

In [17]:
import re
s = "A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM"
lst = re.findall('\S+@\S+', s)
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


Search for lines that have an @ sign between characters

In [39]:
import re
hand = open("../texts/mbox.txt")
for line in hand:
    line = line.rstrip()
    x = re.findall("\S+@\S+", line)
    if len(x)>0:
        print(x)


['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042109.m04L92hb007923@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject

Undesired characters such as $\texttt{;,<,>,}$ etc can be removed in the following way:

In [49]:
# Search for lines that have an at sign between characters
# The characters must be a letter or number
import re
hand = open('../texts/mbox.txt')
criteria = '[a-zA-Z0-9]\S*@\S*[a-zA-Z]'
for line in hand:
    line = line.rstrip()
    x = re.findall(criteria, line)
    if len(x) > 0:
        print(x)

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200801042109.m04L92hb007923@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject

### 11.3 Combining, Searching & Extracting

The numbers in a line that start with a string "X" followed by numbers can computed in the following way:

In [34]:
# Search for lines that start with 'X' followed by any non
# whitespace characters and ':'
# followed by a space and any number.
# The number can include a decimal.

import re

hand = open("../texts/mbox.txt")
criteria = "^X-\S*: [0-9.]+"
for line in hand:
    line = line.rstrip()
    if re.search(criteria, line):
        print(line)

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7565
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7626
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7556
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7002
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7615
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7601
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6959
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7606
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7559
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6932
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7558
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6526
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6948
X-DSPAM-Probability: 0.0000
X-DSPAM-Co

In [32]:
# Search for lines that start with 'X' followed by any
# non whitespace characters and ':' followed by a space
# and any number. The number can include a decimal.
# Then print the number if it is greater than zero.

import re

handle = open("../texts/mbox.txt")
for line in handle:
    line = line.rstrip()
    x = re.findall("^X\S*: ([0-9.]+)", line)
    if len(x)>0:
        print(x)

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']
['0.7003']
['0.0000']
['0.8507']
['0.0000']
['0.9895']
['0.0000']
['0.9965']
['0.0000']
['0.9875']
['0.0000']
['0.9867']
['0.0000']
['0.9903']
['0.0000']
['0.7006']
['0.0000']
['0.9907']
['0.0000']
['0.9886']
['0.0000']
['0.8495']
['0.0000']
['0.7606']
['0.0000']
['0.9875']
['0.0000']
['0.8489']
['0.0000']
['0.9854']
['0.0000']
['0.7549']
['0.0000']
['0.9877']
['0.0000']
['0.9881']
['0.0000']
['0.9864']

If we wanted to extract all of the revision numbers (the integer number at the end
of these lines) using the same technique as above, we could write the following
program:

In [60]:
# Search for lines that start with 'Details: rev='
# followed by numbers
# Then print the number if one is found

import re

hand = open('../texts/mbox.txt')

for line in hand:
    line = line.rstrip()
    x = re.findall('^Details:.*rev=([0-9]+)', line)
    if len(x) > 0:
        print(x)

['39772']
['39771']
['39770']
['39769']
['39766']
['39765']
['39764']
['39763']
['39762']
['39761']
['39760']
['39759']
['39758']
['39757']
['39756']
['39755']
['39754']
['39753']
['39752']
['39751']
['39750']
['39749']
['39746']
['39745']
['39744']
['39743']
['39742']
['39741']
['39740']
['39739']
['39738']
['39737']
['39736']
['39735']
['39734']
['39733']
['39732']
['39731']
['39730']
['39728']
['39729']
['39727']
['39726']
['39725']
['39724']
['39723']
['39722']
['39721']
['39720']
['39719']
['39718']
['39717']
['39716']
['39715']
['39714']
['39713']
['39712']
['39711']
['39710']
['39709']
['39708']
['39707']
['39706']
['39697']
['39696']
['39695']
['39694']
['39692']
['39691']
['39690']
['39689']
['39688']
['39687']
['39686']
['39685']
['39684']
['39683']
['39682']
['39681']
['39680']
['39679']
['39678']
['39677']
['39676']
['39675']
['39674']
['39673']
['39672']
['39671']
['39670']
['39669']
['39668']
['39667']
['39666']
['44484']
['39665']
['39664']
['39663']
['39662']
['39660']


Consider the following entry in a text file:

$\texttt{From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008}$

If we want to extract $\texttt{hour}$, $\texttt{minutes}$ or $\texttt{seconds}$, it can be achieved in the following manner

In [88]:
# Search for lines that start with From and a character
# followed by a two digit number between 00 and 99 followed by ':'
# Then print the number if one is found

import re

s = "From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008"
hour = re.findall("^From .* ([0-9][0-9]):", s)
minutes = re.findall("^From .*:([0-9][0-9]):", s)
seconds = re.findall("^From .*:([0-9][0-9])", s)
print("Hour:", hour[0])
print("Minutes:", minutes[0])
print("Seconds:", seconds[0])

# Similarly, for a file we can re-write it in the following way to extract the weekdays

hand = open('../texts/mbox.txt')
count = 0
for line in hand:
    line = line.rstrip()
    x = re.findall("^From \S* ([A-Z]\S*)", line)
    if len(x)>0:
        if count > 7:
            break
        print(x[0])
        count += 1


Hour: 09
Minutes: 14
Seconds: 16
Sat
Fri
Fri
Fri
Fri
Fri
Fri
Fri


### 11.4 Escape Character
Since we use special characters in regular expressions to match the beginning or
end of a line or specify wild cards, we need a way to indicate that these characters
are “normal” and we want to match the actual character such as a dollar sign or
caret.

We can indicate that we want to simply match a character by prefixing that character with a backslash. For example, we can find money amounts with the following
regular expression.


In [110]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
print(y)

['$10.00']


### 11.6 Exercise 2
Write a program to look for lines of the form:

$\texttt{New Revision: 39772}$

Extract the number from each of the lines using a regular expression
and the $\texttt{findall()}$ method. Compute the average of the numbers and
print out the average as an integer.

In [125]:
def avg_revision(filename):
    import re
    handle = open("../texts/"+filename)
    criteria = "New Revision: ([0-9]+)"
    lst = []
    for line in handle:
        line = line.rstrip()
        x = re.findall(criteria, line)
        if len(x) > 0:
            lst.append(int(x[0]))
    average = sum(lst)/len(lst)
    return int(average)

filename = input("Enter file name: ")

try:
    average = avg_revision(filename)
    print("The average is", average)
except:
    print("Please enter valid file name from the directory")

The average is 38549
