## Lesson 29:

### RegEx Example Program: A Phone and Email Scaraper


This program will do the following:

* Create RegEx for phone numbers
* Create a RegEx for Email Address
* Get the Text (Pyperclip or Requests)
* Extract the email/phone from the text
* Copy the extracted email/phone to the clipboard

#### Create RegEx Objects

In [53]:
# Import Regex Module
import re

## Create Phone Regex
# Basic Phone Number: 415-555-0000
# Phone Numeber without Area Code: 555-0000
# Phone Number with Parenthesis: (415)-555-5000
# Phone Numbers with Extensions: ext /ext. /x12345

phoneRegex = re.compile(
r'''
(                       # entire number (stored as its own group for .findall())
((\d\d\d)|(\(\d\d\d\))) # area code (optional, with and without parenthesis)
(\s|-)                  # first seperator (space or -)
\d\d\d                  # first 3 digits
-                       # second seperator
\d\d\d\d                # last 4 digits
(((ext(\.)?\s)|x))?     # extension word part (optional, as 'ext ', optional 'ext. ' or 'x')
((\d){2-5})?            # extension digit part (optional, with at least 2 or at most 5 digits)
)
''', re.VERBOSE)
# Much more readible in verbose mode than single mode

## Create Email Regex
# Basic Email: something@something.com
# Fancy Email: some.+_thing@something.com
# Alternate domain: something@some.+_thing.anywhere

emailRegex = re.compile(
r'''
[a-zA-Z0-9_.+]+ # name (Custom character class, one or more lowercase, uppercase, numbers and symbols)
@               # @ symbol
[a-zA-Z0-9_.+]+ # domain name (Custom character class, one or more lowercase, uppercase, numbers and symbols)
''', re.VERBOSE)
# Custom character classes do NOT require escaping characters
# Much more readible in verbose mode than single mode


#### Text Importer

You can use `pyperclip` to copy and paste text to and from the clipboard, but this implementation will use `PyPDF2` to [deal with files in a local folder](https://automatetheboringstuff.com/chapter13/). 

All the files used in these exercises are available [here](http://www.nostarch.com/download/Automate_the_Boring_Stuff_onlinematerials.zip), and are stored in this repository in the local `files` folder.

In [63]:
import PyPDF2

## Example PyPDF2 Application
# An illustration of PyPDF2 functions
pdfFile = open('files/meetingminutes.pdf', 'rb') # Open the PDF file from the local folder in pdfFile
pdfReader = PyPDF2.PdfFileReader(pdfFile) # Store the PDF reader object in pdfReader
pageObject = pdfReader.getPage(0) # Get the zero page from the pdfReader object and store in the pageObject; PyPDF2 starts at 0 index
print(pageObject.extractText()) # Print extracted text from the zero page from the pageObject
pdfFile.close() # Close PDF file

## PDFText 1: Phone Directory 

pdfFile1 = open('files/112065.pdf', 'rb') # Open the PDF file from the local folder in pdfFile
pdfReader1 = PyPDF2.PdfFileReader(pdfFile1) # Store the PDF reader object in pdfReader
print(pdfReader1.getNumPages()) # Print number of pages
pdf1PageText = [] # Create empty list object to store text

for page in range(0, pdfReader1.getNumPages()): # Loop through all PDF Pages
    newpdf1PageText = pdfReader1.getPage(page).extractText() # Extract text into a new storage object
    #print(newpdfText)
    pdf1PageText.append(newpdf1PageText) # Append the list of page texts with the text on each page

pdf1FullText = ' \n '.join(pdf1PageText) # join the texts into one long string
print(pdf1FullText) # Print text

## PDFText 2: Email and Phone Directory 

pdfFile2 = open('files/26645.pdf', 'rb') # Open the PDF file from the local folder in pdfFile
pdfReader2 = PyPDF2.PdfFileReader(pdfFile2) # Store the PDF reader object in pdfReader
print(pdfReader2.getNumPages()) # Print number of pages
pdf2PageText = [] # Create empty list object to store text

for page in range(0, pdfReader2.getNumPages()): # Loop through all PDF Pages
    newpdf2PageText = pdfReader2.getPage(page).extractText() # Extract text into a new storage object
    #print(newpdfText)
    pdf2PageText.append(newpdf2PageText) # Append the list of page texts with the text on each page

pdf2FullText = ' \n '.join(pdf2PageText) # join the texts into one long string
print(pdf2FullText) # Print text


OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of 
March 7
, 2014
        
     The Board of Elementary and Secondary Education shall provide leadership and 
create policies for education that expand opportunities for children, empower 
families and communities, and advance Louisiana in an increasingly 
competitive glob
al market.
 BOARD 
 of ELEMENTARY
 and 
 SECONDARY
 EDUCATION
  
57
Organizational DirectoryThis customized report includes the following section(s):United States Department of StateTelephone DirectoryUNCLASSIFIEDProvided by Global Information Services, A/GIS1/20/2016Cover 
 Organizational DirectoryUnited States Department of State2201 C Street NW, Washington, DC 20520Office of the Secretary (S)SecretarySecretary John  Kerry 7th Floor202-647-9572Chief of Staff Jonathan J. Finer 7234202-647-8633Deputy Chief of Staff Jennifer  Stout 7226202-647-5697Deputy Chief of Staff Thomas  Sullivan 7226202-647-9071Executive Assistant Lisa  Kenna 7226202-647-9572Office Manager

#### Regex Extractions

In [58]:
## Extract Match Object Lists
extractedPhone = phoneRegex.findall(pdfFullText)
extractedEmail = emailRegex.findall(pdfFullText)

## Loop through Match Objects and pull out the first gorups
allPhoneNumbers = []
for phoneNumber in extractedPhone:
        allPhoneNumbers.append(phoneNumber[0]) # Store first string from the nth tuples (whole number) in list

#allEmails = []
#for Email in extractedEmail:
#        allEmails.append(extractedEmail[Email]) # Store first string at the nth entry (whole email) in list; don't need index because no tuples (no subgroups)

## Print Matching Strings
#print(allPhoneNumbers)
#print(allEmails)

## Format the results
results = '\n'.join(allPhoneNumbers) + '\n' + '\n'.join(allEmails) 
# Join every phone number into a string, seperated by a new line, create a new line, then join every email into a string, seperated by a new line, and print out the whole thing

print(phoneRegex)
print(emailRegex)
print(extractedEmail)
#print(results)

re.compile("\n(                       # entire number (stored as its own group for .findall())\n((\\d\\d\\d)|(\\(\\d\\d\\d\\))) # area code (optional, with and without parenthesis)\n(\\s|-)                  # fi, re.VERBOSE)
re.compile('\n[a-zA-Z0-9_.+]+ # name (Custom character class, one or more lowercase, uppercase, numbers and symbols)\n@               # @ symbol\n[a-zA-Z0-9_.+]+ # domain name (Custom character class, one or mor, re.VERBOSE)
[]


This is basically a find and replace feature with regex.

Regex objects also have a `re.verbose` argument, to allow multline line comments for complicated regex patterns, helping readabilitiy. 


In [10]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) )   # area code (without parenthesis with dash, with parenthesis without dash )
-                        # first dash
\d\d\d                   # first 3 digits
-                        # second dash
\d\d\d\d                 # last 4 digits
\sx\d{2,4}               # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.VERBOSE) # Allows multiline regex strings that ignore newlines, allowing for new comments/documentation on every line. 


The `re.compile()` function can only take one additional parameter, so if you wanted to use `re.I` to ignore cases, `re.DOTALL` to allow `.*` to see newlines, and `re.VERBOSE` to use multiline regex, you have to apply them with bitwise `OR`;`|`. 

In [None]:
phoneRegex = re.compile(r'''
(\d\d\d-)|(\(d\d\d\) )   # area code (without parenthesis with dash, with parenthesis without dash )
-                        # first dash
\d\d\d                   # first 3 digits
-                        # second dash
\d\d\d\d                 # last 4 digits
\sx\d{2,4}               # Extension, like x1234, with at least 2 and at most 4 digits
'''
, re.I | re.DOTALL | re.VERBOSE) # Activites ignorecase, dotall, and verbose arguments simultaneously. 


This syntax is from old code, and does not typically apply for other functions, just `re.compile()`.

## Recap
* The `.sub` regex method will substitute matches with some other text.
* Using `\1`, `\2`, and so on will substitute group 1, 2, etc into the regex pattern.
* Passing `re.VERBOSE` lets you add whitespace and comments to the regex string passed to `re.compile()` (even in raw strings.)
* If you want to pass multiple arguments to `re.compile()`, like `re.DOTALL`, `re.IGNORECASE`, and `re.VERBOSE`) combine them with the `|` bitwise operator.