This notebook will document my initial learning of Regular Expressions (Regex), which should enable me to webscrape information I need from the difference conference talk webpages by setting patterns and such to look for. 

In [1]:
# imports necessary packages
import re
import os

In [2]:
# set working directory
os.chdir('D:\\Faith and Religion Stuff\\Come, Follow Me Breakdowns\\come-follow-me-breakdown-builder')

The link below is to a fairly comprehensive cheat sheet of Regex patterns.

https://www.rexegg.com/regex-quickstart.php

**re.search(pattern, string)** searches the string object looking for the pattern, and returns a matching object if that pattern is found. Otherwise it returns **None**. 

In [3]:
# saves as object 'text' the text to be searched through
text = 'I have 10 apples and 2 bananas.'
# saves as object 'pattern' the Regex pattern I'm looking for. 
# In this case the pattern is looking for any one or more digits
pattern = '\d+'

# saves as object 'result' the results of search using 
result = re.search(pattern, text)
# if there is any result of the search, print it
if result:
    print(f'Match found: {result.group()}')
# otherwise print 'No match found.'
else:
    print('No match found.')

Match found: 10


Running this, only our first result was returned, since we used the **re.search()** function. If I had used **re.findall()**, it would have returned both 10 and 2. See below. 

In [4]:
# relies on same pattern and text as defined above, saves as a different object. 
result_findall = re.findall(pattern, text)
if result:
    print(f'Match found: {result_findall}')
else:
    print('No match found.')

Match found: ['10', '2']


Using **re.findall** returned the a list object containing both 10 and 2, which are the two items in the text that match the pattern I set. 

In [5]:
# this pattern looks for any word character - ie, alphanumeric characters and underscores. 
# basically, it looks for a-z, A-Z, 0-9, and _. 
pattern = '\w'

result = re.search(pattern, text)
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

Match found: I


The above pattern and code returns "I", the first alphanumeric character in the string. 

In [6]:
# relies on same pattern and text as defined above, saves as a different object. 
result_findall = re.findall(pattern, text)
if result:
    print(f'Match found: {result_findall}')
else:
    print('No match found.')

Match found: ['I', 'h', 'a', 'v', 'e', '1', '0', 'a', 'p', 'p', 'l', 'e', 's', 'a', 'n', 'd', '2', 'b', 'a', 'n', 'a', 'n', 'a', 's']


The code above, using **re.findall()** saves as a list ALL alphanumeric characters found. 

This **|** symbol is the OR operator, specifying that one or another pattern is to be searched for. 

In [7]:
pattern = r'apple|banana'

result = re.search(pattern, text)
if result:
    print("Fruit found:", result.group())
else:
    print("No fruit found.")

Fruit found: apple


Again, because I used **re.search()**, I got the first pattern found. Had I used **re.findall()**, I would have gotten both. 

This operator will be useful for me, as one pattern I will be searching for is **Elder|Sister** in order to establish the gender of the speaker in my primary key making process. 

I could probably also look for **First Presidency|Apostles|Seventy|Relief|Primary|Sunday** in order to establish the organization the speaker belongs to. I'll test that theory below by using multiple OR operator symbols in my pattern. In my webscraping, I'll most likely just use **re.search()** since I'll be looking for the very first mention of any of these offices. 

I've also changed the order to see if that has any impact. I'm **expecting** the result to be 'have', but if 'banana' is the result, it will tell me that the Regex pattern looks through the whole string for the first option or pattern, then looks through the whole string for the second, third, etc. If 'have' returns first, it tells me that Regex looks at the first word, checks if it matches any of the specified patterns, and then moves onto the second. An important distinction. 

In [8]:

pattern = r'banana|apple|have'

result = re.search(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No fruit found.")

Match found: have


Because the function returned 'have', it means that it searches each set of characters in the string one by one looking for the first match of any of the options. Good to know.  

In [9]:
# the dot in this regular expression pattern represents any character EXCEPT new line characters. 
pattern = r'ap.le' 

result = re.search(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No fruit found.")

Match found: apple


The code above returned 'apple' because it found a set of characters that started with "ap", had any single character following it, and then had "le" following that. 

In [10]:
text = 'I love aptles and bananas.'

pattern = r'ap.le' 

result = re.search(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No fruit found.")

Match found: aptle


When I changed the text be 'aptle' instead of 'apple' it returned the same result, because it found, again, a set of characters that started with "ap", had any single character following it, and then had "le" following that. 

I feel like this Regex principle will also be handy. We'll see. 

Putting characters inside brackets **[]** establishes a class, or group, of characters for the regex finder to look for. For example **[aeiou]** looks for a OR e OR i OR o OR u, so it would be the same as **r'a|e|i|o|u"**. I'll demonstrate below. 

In [11]:
# fix the text so it doesn't bother me. 
text = 'I love apples and bananas.'

class_pattern = r'[aeiou]'
or_pattern = r'a|e|i|o|u' 

class_result = re.search(class_pattern, text)
if result:
    print("Match found:", class_result.group())
else:
    print("No fruit found.")

or_result = re.search(or_pattern, text)
if result:
    print("Match found:", or_result.group())
else:
    print("No fruit found.")

Match found: o
Match found: o


The first matching character found in both cases was "o", working exactly as I predicted. Wonderful. 

I can also specify the number of any type of character I am looking for using swirly brackets **{}**. 

In [12]:
# this pattern looks for any 3 consecutive digits, followed by a dash, followed by any 2 consecutive digits, followed by a dash, followed by any 4 consecutive digits. 
# it's looking for social security numbers
ssn_pattern = r'\d{3}-\d{2}-\d{4}'

ssn_text = "My social security number is 123-45-6789."

result = re.search(ssn_pattern, ssn_text)
if result:
    print("SSN found:", result.group())
else:
    print("No SSN found.")

SSN found: 123-45-6789


**re.match()** looks only at the beginning of the string to be assessed. If the pattern at the very start of a search does not match the pattern, it will return nothing. This makes this function useful ***IF*** the string has been stripped (or had any extra spaces at the begininning or end removed). Even then, it's only useful if, for example, you're looking only for phone numbers starting with, for example, "867-" or for books whose title specifically starts with "Harry Potter and". I don't think it will be particularly useful for my purposes. 

In [13]:
# searches for any word character (a-z, A-Z, 0-9, or _)
pattern = '\w'

# changes text to have a space at the start for demonstration
space_text = ' I have an apple and a banana.'

result = re.match(pattern, space_text)
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")
# No match is found because there is a space at the beginning of the string, where it's looking for a-z, A-Z, 0-9, or _. 

No match found.


In [14]:
# does the search, but strips the space_text of extra spaces first. 
result = re.match(pattern, space_text.strip())
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

# finds a match because the space_text was stripped of extra spaces. 
# it's good to know that I can run operators like .strip() inside Regex functions

Match found: I


Now I can experiment a little bit, having laid the ground work. Here's an intense and thorough pattern that is expected to find emails. 

`[\w\.]+@\w+\.\w+`
* `[\w\.]` looks for any **class** of word characters or actual dots or periods. (`.` stands for any character, while `\.` stands for actual dots or periods)
* `@` matches the **@** symbol following the match above
* `\w+` matches one or more word characters (letters, numbers, or underscores) after or following the **@**
* `\.` again matches the literal dot or period after the match above
* `\w+` matches any number of word characters after the dot (`\.`) above

Pretty intense. 


In [15]:
emails_txt = """
Here are some made-up email addresses:
john.doe@example.com
mary_smith123@gmail.com
theodore@example.co.uk
contact_us@company.net
info123@yahoo.com
alice.bob@example.org
support@website.io
sales.department@example.com
test.email@domain.com
random.email@subdomain.co
"""

email_pattern = '[\w\.]+@\w+\.\w+'

emails = re.findall(email_pattern, emails_txt)

emails

['john.doe@example.com',
 'mary_smith123@gmail.com',
 'theodore@example.co',
 'contact_us@company.net',
 'info123@yahoo.com',
 'alice.bob@example.org',
 'support@website.io',
 'sales.department@example.com',
 'test.email@domain.com',
 'random.email@subdomain.co']

The problem with this, though, is that when you look at the email beginning with 'theodore', this regular expression pattern cuts of the ".uk" at the end, because it doesn't account for the fact that the last set of word characters could be followed by ".something". So I need to adjust the search pattern to fix this. 

In [16]:
fixed_pattern = '[\w\.]+@\w+\.\w+\.\w+'
# all I did here was add an extra dot at the end and an extra set of word characters. 

emails = re.findall(fixed_pattern, emails_txt)

emails
# The problem with this is that it ONLY returned the theodore email because it's the only one with the extra .something at the end. 
# I suspected that would happen. I need to make that last bit optional. 

['theodore@example.co.uk']

One attempt I can make would be to take the original expression, and add my fixed one as an alternative option using the OR operator `|`. 

In [17]:
fixed_pattern = r'[\w\.]+@\w+\.\w+|[\w\.]+@\w+\.\w+\.\w+'

emails = re.findall(fixed_pattern, emails_txt)

emails

['john.doe@example.com',
 'mary_smith123@gmail.com',
 'theodore@example.co',
 'contact_us@company.net',
 'info123@yahoo.com',
 'alice.bob@example.org',
 'support@website.io',
 'sales.department@example.com',
 'test.email@domain.com',
 'random.email@subdomain.co']

However, that didn't work. What if I put the alternative option first?

In [18]:
fixed_pattern = r'[\w\.]+@\w+\.\w+\.\w+|[\w\.]+@\w+\.\w+'

emails = re.findall(fixed_pattern, emails_txt)

emails

['john.doe@example.com',
 'mary_smith123@gmail.com',
 'theodore@example.co.uk',
 'contact_us@company.net',
 'info123@yahoo.com',
 'alice.bob@example.org',
 'support@website.io',
 'sales.department@example.com',
 'test.email@domain.com',
 'random.email@subdomain.co']

HAHA! That did it. That's probably not what anyone was expecting, but it totally works. There's probably a more efficient way to do it, though.

Looking at the linked cheat sheet, it seems like I can use the `*` zero or more operator. Let's try that. 

In [22]:
fixed_pattern = '[\w\.]+@\w+\.\w+\.*\w*'
# this takes everything that was already there and at the end looks for zero or more dots followed by zero or more word characters

emails = re.findall(fixed_pattern, emails_txt)

emails

['john.doe@example.com',
 'mary_smith123@gmail.com',
 'theodore@example.co.uk',
 'contact_us@company.net',
 'info123@yahoo.com',
 'alice.bob@example.org',
 'support@website.io',
 'sales.department@example.com',
 'test.email@domain.com',
 'random.email@subdomain.co']

That worked. Awesome. I found two possible solutions, one less elegant, one more so. 

`re.iter()` works basically the same way as `re.findall()` but it returns results individually instead of as a list. This will be useful later when I'm looking for start dates and end dates, which I will want stored as separate objects as soon as they are found so I don't need to separate them myself later.

In [24]:
# pattern to search for is any number of digits
pattern = '\d+'
text = 'I have 3 apples and 5 bananas and 10 oranges and 3 strawberries and 9 melons and only 1 dragonfruit.'

matches = re.finditer(pattern, text)
for match in matches:
    print(f"Match found: {match.group()}")
# returns two individual matches

Match found: 3
Match found: 5
Match found: 10
Match found: 3
Match found: 9
Match found: 1


Since I only have a few minutes left, I want to play around and try something that may be useful when I'm getting those dates. 

In [26]:
# pattern to search for is any number of digits
pattern = '\d+'
text = 'I have 3 apples and 5 bananas and 10 oranges and 3 strawberries and 9 melons and only 1 dragonfruit.'

matches = re.finditer(pattern, text)
# sets an initial count at zero
count = 0
for match in matches:
    # for every match found, add 1 to the counter
    count += 1
    # then, if the count divided by 2 has a remainder of 1, print start date and match found
    if count %2 == 1:
        print(f'start date: {match.group()}')
    # otherwise, or if the count divided by 2 has a remainder of 0, print end date and match found
    else:
        print(f'end date: {match.group()}')

end date: 3
start date: 5
end date: 10
start date: 3
end date: 9
start date: 1


Ah! It doesn't make sense here, but dang that's amazing! It totally worked exactly the way I wanted it to. 

When I use this kind of functionality in the future, I will be able to categorize the first date appearance as start date and the second date appearance as end date when I'm doing my Come, Follow Me Breakdown. Awesome. 

Tomorrow, I'll wrap up learning the basics of Regex and continue on getting my General Conference Breakdown tool built. 