In [2]:
import re

# REGEX

In [2]:
text = "It's a good day today"

if re.search("good", text):
    print("Wonderful!")
else:
    print("Alas")

Wonderful!


## Tokenizing

In [3]:
# findall() and split() - parses the strng and returns chunks
amy = "Amy goes to the library. Amy does this. Our student Amy always gets good grades"
re.split("Amy", amy)

['',
 ' goes to the library. ',
 ' does this. Our student ',
 ' always gets good grades']

In [4]:
re.findall("Amy", amy)

['Amy', 'Amy', 'Amy']

## Anchors

Specify the start and/or end of the strings I'm trying to match

- ^ => start
- $ => end
- . => matches any character except newline (if re.DOTALL = True, matches newline as well)

In [10]:
re.search("^Amy", amy) # find occurences starting with Amy

<re.Match object; span=(0, 3), match='Amy'>

### Patterns and character classes

In [21]:
grades = "ABABBBAACABB"

# How many Bs?
len(re.findall("B", grades))

6

In [20]:
# How many As or Bs? 
# Set operator - using it means character-based matching
re.findall("[AB]", grades)

['A', 'B', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B']

In [22]:
# How many As followed by Bs or Cs?
re.findall("[A][B-C]", grades)
# Here we can also include a range of characters which are ordered alphanumerically
# If I wanted to refer to all uppercase alphabe characters, I would have done [A-Z]



['AB', 'AB', 'AC', 'AB']

In [24]:
# Notice how the [AB] denotes a set which could be A or B, while [A][B-C] denotes two sets which must have been matched back to back. We could use this as well:
re.findall("AB|AC", grades)

['AB', 'AB', 'AC', 'AB']

In [25]:
# I can use ^ for negation
re.findall("[^A]", grades) # find the grades that are not As

['B', 'B', 'B', 'B', 'C', 'B', 'B']

### Quantifiers

The number of times you want a pattern to be matched in order to match. 

```
e{m} => exactly m copies
e{m,n} => from m to n repetitions
e{m,n}? => non-greedy, from m to n reps but as few as possible
e{m.n}+ => as many as possible
```

e is the expression, m is the minimum number to match and n the maximum

In [28]:
re.findall("A{2}", grades)

['AA']

#### Other quantifiers

- "*" - match zero or more times
- "?" - match zero or one times
- "+" - match one or more times

**\***, **?** and **+** are all greedy, the match as much text as possible.
To have them perform in a non-greedy fashion, one can add **?** after them (*?, ??, +?)

In [21]:
with open("ferpa.txt") as ferpatxt:
    wiki = ferpatxt.read()

print(f"{wiki[:500]}...")

The Family Educational Rights and Privacy Act of 1974 (FERPA or the Buckley Amendment) is a United States federal law that governs the access to educational information and records by public entities such as potential employers, publicly funded educational institutions, and foreign governments.[1] The act is also referred to as the Buckley Amendment, for one of its proponents, Senator James L. Buckley of New York.[2]

FERPA is a U.S. federal law that regulates access and disclosure of student ed...


In [11]:
# something easier maybe? Try to match section titles (Overview, etc.)
# They are preceded by a new line, followed by a new line
re.findall("\n[\w ]+\n", wiki)

['\nOverview\n', '\nAccess to public records\n', '\nStudent medical records\n']

In [24]:
# Since I don't have this guy's ferpa.txt, I'm gonna do something different
# Try to match sentence preceding citations marked by numbers in square brackets
re.findall("[\w .,\"()']+\[[0-9]+\]", wiki)

['The Family Educational Rights and Privacy Act of 1974 (FERPA or the Buckley Amendment) is a United States federal law that governs the access to educational information and records by public entities such as potential employers, publicly funded educational institutions, and foreign governments.[1]',
 ' The act is also referred to as the Buckley Amendment, for one of its proponents, Senator James L. Buckley of New York.[2]',
 "FERPA gives parents access to their child's education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student's consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.[3]",
 'mail addresses.[4]',
 " For example, schools may provide external companies with a st

### Metacharacters

- \w - matches any word
- \W - non-word
- \d - any single digit 0-9
- \D - non-digit
- \s - matches any whitespace
- \S - non-whitespace
- \b - word boundary (the position where a word char is followed by a non-word char or viceversa)
- \B - non word boundary (any position that is not word boundary

### Groups

In [30]:
# What if I wanted to get the seciton titles, but without the surrounding \n?
# I use groups, which are enclosed in  (...)

titles = re.findall("(\n)([\w ]+)(\n)", wiki)
for title in titles:
    print(title[1])

Overview
Access to public records
Student medical records


In [39]:
# Alternatively
for item in re.finditer("(\n)([\w ]+)(\n)", wiki):
    print(item.group(2)) # why 2?????

Overview
Access to public records
Student medical records


### Look-ahead and look-behind

In [53]:
# Exactly what I was searching for, use a pattern to match but not include it in the results
# so we use ?= preceding the expression, which surprisingly only works for the last \n...

re.findall("(\n)(?P<title>[\w ]+)(?=\n)", wiki)

[('\n', 'Overview'),
 ('\n', 'Access to public records'),
 ('\n', 'Student medical records')]

### Verbose mode (multiline)

In [54]:
with open("buddhist.txt") as newfile:
    buddhist = newfile.read()

buddhist

'There are several Buddhist universities in the United States. Some of these have existed for decades and are accredited. Others are relatively new and are either in the process of being accredited or else have no formal accreditation. The list includes:\n\nDhammakaya Open University – located in Azusa, California, part of the Thai Wat Phra Dhammakaya[1]\nDharmakirti College – located in Tucson, Arizona Now called Awam Tibetan Buddhist Institute (http://awaminstitute.org/)\nDharma Realm Buddhist University – located in Ukiah, California (Accredited by the WASC Senior College and University Commission)\nEwam Buddhist Institute – located in Arlee, Montana\nNaropa University is located in Boulder, Colorado (Accredited by the Higher Learning Commission)\nInstitute of Buddhist Studies – located in Berkeley, California\nMaitripa College – located in Portland, Oregon\nSoka University of America – located in Aliso Viejo, California\nUniversity of the West – located in Rosemead, California (Acc

In [58]:
# With verbose mode we can write multiline regexes which increases visibility
# Caveat: all whitespace chars must be prepended wth a \ or by using \s
# \nDharmakirti College – located in Tucson, Arizona Now called Awam Tibetan Buddhist Institute
pattern = """
(\n)
(?P<title>.*)  # the university
(\ –\ located\ in\ ) # indicator of location
(?P<city>\w*)      # the city
(,\ )              # whatever
(?P<state>\w*)     # the state
"""

for item in re.finditer(pattern, buddhist, re.VERBOSE):
    print(item.groupdict())

{'title': 'Dhammakaya Open University', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Institute of Buddhist Studies', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'University of the West', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies', 'city': 'Glenside', 'state': 'Pennsylvania'}


In [59]:
# Another exxample on the NYTimes health tweet data
with open("nytimeshealth.txt") as file:
    tweets = file.read()
tweets[:1000]

"548662191340421120|Sat Dec 27 02:10:34 +0000 2014|Risks in Using Social Media to Spot Signs of Mental Distress http://nyti.ms/1rqi9I1\n548579831169163265|Fri Dec 26 20:43:18 +0000 2014|RT @paula_span: The most effective nationwide diabetes prevention program you've probably never heard of:  http://newoldage.blogs.nytimes.com/2014/12/26/diabetes-prevention-that-works/\n548579045269852161|Fri Dec 26 20:40:11 +0000 2014|The New Old Age Blog: Diabetes Prevention That Works http://nyti.ms/1xm7fTi\n548444679529041920|Fri Dec 26 11:46:15 +0000 2014|Well: Comfort Casseroles for Winter Dinners http://nyti.ms/1xTNoO0\n548311901227474944|Fri Dec 26 02:58:39 +0000 2014|High-Level Knowledge Before Veterans Affairs Scandal http://nyti.ms/13yCpvS\n548305625449787392|Fri Dec 26 02:33:42 +0000 2014|Your Money: Affordable Care Act’s Tax Effects Now Loom for Filers http://nyti.ms/13yAtUf\n548283182853160960|Fri Dec 26 01:04:32 +0000 2014|Well: Christmas in the Hospital http://nyti.ms/1vtPNcm\n5482784145

In [72]:
pattern = """
(?P<id>[0-9]+)    # tweet ID?
(\|)
(?P<timestamp>.*)        # Date
(\|)
(?P<tweet>.*)
(?=\n)
"""

for item in re.finditer(pattern, tweets, re.VERBOSE):
    print(item.groupdict())

{'id': '548662191340421120', 'timestamp': 'Sat Dec 27 02:10:34 +0000 2014', 'tweet': 'Risks in Using Social Media to Spot Signs of Mental Distress http://nyti.ms/1rqi9I1'}
{'id': '548579831169163265', 'timestamp': 'Fri Dec 26 20:43:18 +0000 2014', 'tweet': "RT @paula_span: The most effective nationwide diabetes prevention program you've probably never heard of:  http://newoldage.blogs.nytimes.com/2014/12/26/diabetes-prevention-that-works/"}
{'id': '548579045269852161', 'timestamp': 'Fri Dec 26 20:40:11 +0000 2014', 'tweet': 'The New Old Age Blog: Diabetes Prevention That Works http://nyti.ms/1xm7fTi'}
{'id': '548444679529041920', 'timestamp': 'Fri Dec 26 11:46:15 +0000 2014', 'tweet': 'Well: Comfort Casseroles for Winter Dinners http://nyti.ms/1xTNoO0'}
{'id': '548311901227474944', 'timestamp': 'Fri Dec 26 02:58:39 +0000 2014', 'tweet': 'High-Level Knowledge Before Veterans Affairs Scandal http://nyti.ms/13yCpvS'}
{'id': '548305625449787392', 'timestamp': 'Fri Dec 26 02:33:42 +0000 201