# Regular Expressions

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [1]:
# import re
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

#### RegEx Cheatsheet

In [2]:
########################
# DO NOT RUN THIS CELL #
########################

# a, X, 9, < -- ordinary characters just match themselves exactly.
# . (a period) -- matches any single character except newline '\n'
# \w -- matches a "word" character: a letter or digit or underscore [a-zA-Z0-9_].
# \W -- matches any non-word character.
# \b -- matches word boundary (in between a word character and a non word character)
# \s -- matches a single whitespace character -- space, newline, return, tab
# \S -- matches any non-whitespace character.
# \t, \n, \r -- tab, newline, return
# \d -- matches any numeric digit [0-9]
# \D matches any non-numeric character.
# ^ -- matches the beginning of the string, or specify omition of certain characters
# $ -- matches the end of the string
# \ -- escapes special character.
# (x|y|z) matches exactly one of x, y or z.
# (x) in general is a remembered group. We can get the value of what matched by using the groups() method of the object returned by re.search.
# x? matches an optional x character (in other words, it matches an x zero or one times).
# x* matches x zero or more times.
# x+ matches x one or more times.
# x{m,n} matches an x character at least m times, but not more than n times.
# ?: matches an expression but do not capture it. Non capturing group.
# ?= matches a suffix but exclude it from capture. Positive lookahead.
# a(?=b) will match the "a" in "ab", but not the "a" in "ac"
# In other words, a(?=b) matches the "a" which is followed by the string 'b', without consuming what follows the a.
# ?! matches if suffix is absent. Negative look ahead.
# a(?!b) will match the "a" in "ac", but not the "a" in "ab"
# ?<= positive look behind
# [] matches for groupings of consecutive characters
# ?<! negative look behind

########################
# DO NOT RUN THIS CELL #
########################

##### re.compile()

In [3]:
# using compile, pre determines the string to be used in regular expression methods
#Unicode is a universal set of numbers that are unique identifiers

pattern = re.compile('123abcd') #compile into unicode 
pattern

re.compile(r'123abcd', re.UNICODE)

##### re.match()

In [4]:
#used to check one specific value(form validation for email/phone number) always starts at the beginning

match = pattern.match('123abcd123') #check this string for any similarities/pattern #finds one word match 
print(match)

# Accessing the span of the match
print(match.span())     #span gives you where the match starts and where the match ends span=(0,7)
'123abcd123'[0:7]

<re.Match object; span=(0, 7), match='123abcd'>
(0, 7)


'123abcd'

##### re.findall()

In [5]:
#returns all instances of the pattern as a list

finders = pattern.findall('123abcd abcd123 ab abc 123abcd RegEx is fun a')
print(finders)

['123abcd', '123abcd']


##### re.search()

In [6]:
#search is used to find a pattern match anywhere in the string #search finds first instance anywhere in string

random_string = "123 123 234 123abcd abcd 123abcd abc"
searching = pattern.search(random_string)
print(searching)
span = searching.span()
print(span)

print(random_string[span[0]: span[1]])  #[span[0]: span[1]] finds specific match  by indexing into that span to find [12:19]
                        #[span[0]: span[1]] tuple will always be as is to pull out index
random_string[12:19]

<re.Match object; span=(12, 19), match='123abcd'>
(12, 19)
123abcd


'123abcd'

### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [7]:
pattern_int = re.compile('[0-7][7-9][0-3]')  #always going to be inside a string
                        # 6 is between[0-7], 7 is between[7-9], 3 is between[0-3]
random_numbers = pattern_int.search('67383') 
print(random_numbers)
span = random_numbers.span() #to isolate tuple span=(0,3) or pull our (0,3)

'67383'[span[0]:span[1]]  #will give us '673'

<re.Match object; span=(0, 3), match='673'>


'673'

##### Character Ranges

In [8]:
char_pattern = re.compile('[A-Z][a-z]')  #looking for one uppercase followerd by one lowercase

found = char_pattern.findall('Hello There Mr.Anderson')
print(found)

['He', 'Th', 'Mr', 'An']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [9]:
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}') #one instance of capitol and lower letter, and any range of [0-3] twice

found_count = char_pattern_count.findall('Hello Mr.An33derson')
print(found_count)  #you get ['An33'] for one capitol,one lowercase, and number 3 between [0-3] twice

['An33']


##### {x, x} - something that occurs between x and x times

In [10]:
#in {} expressions, it is inclusive of 1-5
#m OR mm OR mmm OR mmmm OR mmmmm
#looking character by character no matter letter placement 
random_pattern = re.compile('m{1,5}') 
random_statement = random_pattern.findall('This is an example of a regular expression trying to find one m, more than one mmm or five mmmmms ')

print(random_statement)  #looking for only one character{1,5} as long as its 1-5 m's in a row 


['m', 'm', 'm', 'mmm', 'mmmmm']


##### ? - something that occurs 0 or 1 time

In [11]:
#form checking a phone number  #before the question mark can happen zero times or many times. must match at least one time

pattern = re.compile('Mrss?')

found_pattern = pattern.findall('Hello there Mr.Anderson, how is Mrs.Anderson, and Mrsss.Anderson?')
found_pattern

['Mrs', 'Mrss']

##### * - something that occurs at least 0 times

In [12]:
pattern_m = re.compile('M*s') #looking for zero or many M's #must be followed by an s

found_m = pattern_m.findall('MMMs name is Ms.Smith. This is Msssss')
print(found_m)

['MMMs', 's', 'Ms', 's', 's', 'Ms', 's', 's', 's', 's']


##### + - something that occurs at least once

In [13]:
pattern = re.compile('M+s') #looking for on of many M's followed by only one s
found_patt = pattern.findall('MMMs name is Ms.Smith This is MMMsssssssss')
print(found_patt)

['MMMs', 'Ms', 'MMMs']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [14]:
#my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."
#output = ['10909090', '1', '2']
                                # + one or many numbers 
pattern = re.compile('[0-9]+') #a number is comprised of 1 or many +, + directly preceding before the +
my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."

found_nums = pattern.findall(my_string)
print(found_nums)


['10909090', '1', '2']


### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [15]:
pattern_1 = re.compile('\w+')
pattern_2 = re.compile('\W+')

found_1 = pattern_1.findall('This is a sentence. With an, exclamation mark at the end!')
found_2 = pattern_2.findall('This is a sentence. With an, exclamation mark at the end!')

print(found_1)
print(found_2)  #grabbing everything that is not a word

['This', 'is', 'a', 'sentence', 'With', 'an', 'exclamation', 'mark', 'at', 'the', 'end']
[' ', ' ', ' ', '. ', ' ', ', ', ' ', ' ', ' ', ' ', '!']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [16]:
#looking for a letter followed by st/th/rd of a date  #specifically looking for a date '\d{1,2}'
pattern_nums = re.compile('\d{1,2}[a-z]{2}') #digit looking for any number [0-9], {1,2} if it happens once oup to twice, letters a-z twice

found_date = pattern_nums.findall('Today is the 19th, tomorrow is the 20th. My birthday is the 3rd')

pattern_no_num = re.compile('\D+')
found_no_num = pattern_no_num.findall('nate4569')  #only looks for digit
print(found_date)
print(found_no_num)

['19th', '20th', '3rd']
['nate']


##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [17]:
pattern_no_space = re.compile('\S[a-z]+')  #looks for anything that is not a space
pattern_space = re.compile('\s+')   #looks for only the white space

found_space = pattern_space.findall('Are you afraid of the dark?')
print(found_space)

found_no_space = pattern_no_space.findall('Are you :afriad of the dark?')
print(found_no_space)

[' ', ' ', ' ', ' ', ' ']
['Are', 'you', ':afriad', 'of', 'the', 'dark']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [18]:
#"Thecodingtemple" #\b looks for the coding between The and temple

pattern_bound = re.compile(r'\bTheCodingTemple\b')  #looking for spaces/new lines/tabs
pattern_bound_none = re.compile(r'\BTheCodingTemple\B')

found_bound = pattern_bound.findall('TheCodingTemple    ')
print(found_bound)

no_found_bound = pattern_bound_none.findall('1234TheCodingTemple123')
print(no_found_bound)


th = re.compile(r'\bthe')
new_string = 'three then there thirty'
th.findall(new_string)

['TheCodingTemple']
['TheCodingTemple']


['the', 'the']

In [19]:
print(r'Hey my name is \nNate')  #r-raw string will print \nNate #r prints literally

Hey my name is \nNate


### Grouping

In [20]:
my_string_again = "Max Smith, aaron rodgers, Sam Darnold, LeBron James, Micheal Jordan, Kevin Durant, Patrick McCormick"

#Group of names RegEx Compiler using 2 separate groups for first/last names

pattern_name = re.compile('([A-Z][A-Za-z]+) ([A-Z][A-Za-z]+)') #first group space second group

found_names = pattern_name.findall(my_string_again)
print(found_names)
#will give a list of index: 1-6

for name in found_names:
    print(f'First name: {name[0]}\nLas Name: {name[1]}')
    
    
#splitting and using .search() syntax to get a match object
for name in my_string_again.split(', '):
    match = pattern_name.search(name)
    if match:
        print(name)
    else:
        print('not a name')


[('Max', 'Smith'), ('Sam', 'Darnold'), ('LeBron', 'James'), ('Micheal', 'Jordan'), ('Kevin', 'Durant'), ('Patrick', 'McCormick')]
First name: Max
Las Name: Smith
First name: Sam
Las Name: Darnold
First name: LeBron
Las Name: James
First name: Micheal
Las Name: Jordan
First name: Kevin
Las Name: Durant
First name: Patrick
Las Name: McCormick
Max Smith
not a name
Sam Darnold
LeBron James
Micheal Jordan
Kevin Durant
Patrick McCormick


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [21]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None

pattern_email = re.compile('([\w]+)@([\w]+).(com|org)') #\W \w any combination of letters andor numbers, @, ., come|org
found = pattern_email.findall(my_emails[1])
print(found)

#Grouping example
print(f'Username is {found[0][0]} / domain is {found[0][1]} ')

def validateEmail(email_list):
    pattern_email = re.compile('([\w]+)@([\w]+).(com|org)$') #$ once you find com or org, stop looking for the other
    
    for email in email_list:
        if pattern_email.match(email):
            print(email)
        else:
            print('None')
validateEmail(my_emails)




[('pocohontas1776', 'gmail', 'com')]
Username is pocohontas1776 / domain is gmail 
None
pocohontas1776@gmail.com
None
yourfavoriteband@g6.org
None


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [22]:
f = open("files/names.txt")

data = f.read()
print(data)

f.close()

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



##### with open()

In [23]:
with open('files/names.txt') as f:
    data = f.readlines()
print(data[0])

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins



##### re.match()

In [25]:
print(re.match('Hawkins, Derek', data)) 

TypeError: expected string or bytes-like object

##### re.search()

In [None]:
print(re.search('ripalp@codingtemple.com', data))

##### Store the String to a Variable

In [None]:
answer = input('What do you want to look for')
found = re.findall(answer, data)

if found:
    print(f'here is your answer...{found}')
else:
    print('Sorry, that is not in this data set')

### Homework Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [26]:
import re


pattern = re.compile("([\w]*)([A-Z][a-z]+), ([A-Z][a-z]+).*\s(@[A-Za-z]+)")

for t_handle in data:
    found = pattern.search(t_handle)
#     print(found)
    
    if found:
        print('\n'f"{found.group(3)} {found.group(2)}{found.group(1)} / {found.group(4)}")



Derek Hawkins / @derekhawkins

Sven Osterberg / @sverik

Ryan Butz / @ryanbutz

Example Exampleson / @example

Ripal Pael / @ripalp

Darth Vader / @darthvader


### Regex project

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [None]:
"""
Expected Output:
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

In [28]:
with open('files/regex_test.txt') as f:
    data = f.readlines()
    print(data)

['Abraham Lincoln\n', 'Andrew P Garfield\n', 'Connor Milliken\n', 'Jordan Alexander Williams\n', 'Madonna\n', 'programming is cool\n']


In [29]:
pattern = re.compile("([A-Z][a-z]+) ([\w ]*)")

    
for upNames in data:
    found = pattern.search(upNames)
    
    if found:
        print(f'{found.group(1)} {found.group(2)}')
    else:
        print("None")

    

Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
