# Regular Expressions

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [1]:
# import re
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

#### RegEx Cheatsheet

In [2]:
########################
# DO NOT RUN THIS CELL #
########################

a, X, 9, < -- ordinary characters just match themselves exactly.
. (a period) -- matches any single character except newline '\n'
\w -- matches a "word" character: a letter or digit or underscore [a-zA-Z0-9_].
\W -- matches any non-word character.
\b -- matches word boundary (in between a word character and a non word character)
\s -- matches a single whitespace character -- space, newline, return, tab
\S -- matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- matches any numeric digit [0-9]
\D matches any non-numeric character.
^ -- matches the beginning of the string, or specify omition of certain characters
$ -- matches the end of the string
\ -- escapes special character.
(x|y|z) matches exactly one of x, y or z.
(x) in general is a remembered group. We can get the value of what matched by using the groups() method of the object returned by re.search.
x? matches an optional x character (in other words, it matches an x zero or one times).
x* matches x zero or more times.
x+ matches x one or more times.
x{m,n} matches an x character at least m times, but not more than n times.
?: matches an expression but do not capture it. Non capturing group.
?= matches a suffix but exclude it from capture. Positive lookahead.
a(?=b) will match the "a" in "ab", but not the "a" in "ac"
In other words, a(?=b) matches the "a" which is followed by the string 'b', without consuming what follows the a.
?! matches if suffix is absent. Negative look ahead.
a(?!b) will match the "a" in "ac", but not the "a" in "ab"
?<= positive look behind
[] matches for groupings of consecutive characters
?<! negative look behind

########################
# DO NOT RUN THIS CELL #
########################

SyntaxError: invalid syntax (<ipython-input-2-1d535c42b0ad>, line 5)

##### re.compile()

In [4]:
# using compile, pre determines the string to be used in regular expression methods

pattern = re.compile('abcd') # --> checks the compiler for ("abcd") unicode characters
pattern

re.compile(r'abcd', re.UNICODE)

##### re.match()

In [5]:
match = pattern.match("abcd123")
print(match)

# Accessing the span of the match
print(match.span())
"abcd123"[0:]

match.span() # returns the indexes of the object

<re.Match object; span=(0, 4), match='abcd'>
(0, 4)


(0, 4)

##### re.findall()

In [6]:
finders = pattern.findall("123abcd abcd123 ab abc 123abcd RegEx is fun a")
print(finders)

['abcd', 'abcd', 'abcd']


##### re.search()

In [7]:
random_string = "Coding is fun!!! 123 123 234 123abcd abcd abc"

searching = pattern.search(random_string)
print(searching)
span = searching.span() 
print(span)
print(random_string[span[0]: span[1]])
print(span[0])
print(span[1])
# search is used to find a pattern match anywhere in the string (will pick up the first instance)


<re.Match object; span=(32, 36), match='abcd'>
(32, 36)
abcd
32
36


### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [8]:
pattern_int = re.compile("[0-7][7-9][0-3]") 
# look for pattern within an int --> ("[first number between 0-7][second number between 7-9][third number between 0-3]")
random_numbers = pattern_int.search("67383")
span = random_numbers.span()
print(random_numbers)
print(random_numbers[0])
print("67383"[span[0]: span[1]])



random_numbers_2 = pattern_int.search("18345678")
span = random_numbers_2.span()
print(random_numbers_2)
print(random_numbers_2[0])
print("18345678"[span[0]: span[1]])

<re.Match object; span=(0, 3), match='673'>
673
673
<re.Match object; span=(0, 3), match='183'>
183
183


##### Character Ranges

In [9]:
char_pattern = re.compile("[A-Z][a-z]")

# look for pattern within a letter --> ("[Uppercase letter][lowercase letter]")

found = char_pattern.findall("Hello There Mr. Anderson") # Because this is findall, we get a list returned
print(found)

['He', 'Th', 'Mr', 'An']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [10]:
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}') 

# the {2} means it should occur twice for what is written before it  --> [0-3]{2} means [0-3][0-3]

found_count = char_pattern_count.findall("Hello Mr. An17derson") # will return [] because 7 is not within the range
print(found_count)

found_count_2 = char_pattern_count.findall("Hello Mr. An23derson") # will return [] because 7 is not within the range
print(found_count_2)

[]
['An23']


##### {x, x} - something that occurs between x and x times

In [11]:
random_pattern = re.compile("m{1,5}") # m{1,5} means looking for m that occurs 1 to 5 times
random_statement = random_pattern.findall('This is an example of a regular expression trying to find one m, more than one mmm or five mmmmms ')

print(random_statement)

['m', 'm', 'm', 'mmm', 'mmmmm']


##### ? - something that occurs 0 or 1 time

In [12]:
pattern = re.compile("Mrss?")

found_pattern = pattern.findall("Hello there Mr. Anderson, how is Mrs. Anderson, and Mrss. Anderson?")
print(found_pattern)

['Mrs', 'Mrss']


##### * - something that occurs at least 0 times

In [13]:
pattern_m = re.compile("M*s")

found_m = pattern_m.findall("MMMs name is Ms. Smith. This is Msssss")
print(found_m)

['MMMs', 's', 'Ms', 's', 's', 'Ms', 's', 's', 's', 's']


##### + - something that occurs at least once

In [14]:
pattern = re.compile("M+s") # means captial M must occur once and lowercase s must occur once

found_patt = pattern.findall("MMMs name is Ms. Smith Whis is MMMsssssssssss")

print(found_patt)

['MMMs', 'Ms', 'MMMs']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [15]:
# Output = ['10909090', '1', '2']

my_string = re.compile("[0-9]{1,8}")
found_p = my_string.findall("This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day.")
print(found_p)


# OR 
my_string = re.compile("[0-9]+")
found_p = my_string.findall("This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day.")
print(found_p)

['10909090', '1', '2']
['10909090', '1', '2']


### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [16]:
pattern_1 = re.compile('\w+')
pattern_2 = re.compile('\W+')

found_1 = pattern_1.findall("This is a sentence. With an, exclamation mark at the end!")
found_2 = pattern_2.findall("This is a sentence. With an, exclamation mark at the end!")

print(found_1)
print(found_2)

['This', 'is', 'a', 'sentence', 'With', 'an', 'exclamation', 'mark', 'at', 'the', 'end']
[' ', ' ', ' ', '. ', ' ', ', ', ' ', ' ', ' ', ' ', '!']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [17]:
pattern_nums = re.compile("\d{1,2}[a-z]{2}")
pattern_nums2 = re.compile("\D{1,2}[a-z]{2}")

found_date = pattern_nums.findall("Today is the 19th, tomorrow is the 20th. My birthday is the 3rd")
found_date2 = pattern_nums2.findall("Today is the 19th, tomorrow is the 20th. My birthday is the 3rd")

print(found_date)
print(found_date2)


pattern_no_num = re.compile("\D+")
found_no_num = pattern_no_num.findall("Eric...:4569")
print(found_no_num)

['19th', '20th', '3rd']
['Toda', 'y is', ' the', ', to', 'morr', 'w is', ' the', 'y bi', 'rthd', 'y is', ' the']
['Eric...:']


##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [18]:
pattern_no_space = re.compile("\S[a-z]+")
pattern_space = re.compile("\s+")

found_space = pattern_space.findall("Are you afraid of the dark?")
print(found_space) # prints 5 spaces

found_no_space = pattern_no_space.findall("Are you :afraid of the dark?")
print(found_no_space)

[' ', ' ', ' ', ' ', ' ']
['Are', 'you', ':afraid', 'of', 'the', 'dark']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [19]:
pattern_bound = re.compile(r'\bTheCodingTemple\b') # r is important so python doesn't misinterpret the string
pattern_bound_none = re.compile(r'\BTheCodingTemple\B')


found_bound = pattern_bound.findall("         TheCodingTemple    ")
print(found_bound)

no_found_bound = pattern_bound_none.findall("TheCodingTemple")
print(no_found_bound)

th = re.compile(r'\bthe')
new_string = "three then there thirty"

th.findall(new_string)

['TheCodingTemple']
[]


['the', 'the']

### Grouping

In [42]:
# print("Hey my name is \bEric") # b is backspace
# Cannot do findall with a list

my_string_again = "Max Smith, aaron rodgers, Sam Darnold, LeBron James, Micheal Jordan, Kevin Durant, Patrick McCormick"

# Group of names RegEx Compiler using 2 serperate groups for first/last names

pattern_name = re.compile("([A-Z][A-Za-z]+) ([A-Z][A-Za-z]+)") # [A-Za-z] --> take into account upper and lowercase letters
found_names = pattern_name.findall(my_string_again)
print(found_names)

for name in found_names:
    print(f'First name: {name[0]} \nLast Name: {name[1]}')
    
    
# # Splitting and using .search() syntax to get a matched object
# for name in my_string_again.split(", "):
#     match = pattern_name.search(name)
    
#     if match:
#         print(match.groups(1))
#     else:
#         print("not a name")

[('Max', 'Smith'), ('Sam', 'Darnold'), ('LeBron', 'James'), ('Micheal', 'Jordan'), ('Kevin', 'Durant'), ('Patrick', 'McCormick')]


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [38]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None

def validateEmail(email_list):
    email_search = re.compile('([\w]+)+@([\w]+)+.(com|org)$')
    for email in email_list:
        if email_search.match(email):
            print(email)
        else:
            print("None")
            
validateEmail(my_emails)

# The "$" is used to specifically say this or that once


# Grouping example:
print(f'Username is {found[0]} / domain is {found[1]}')

None
pocohontas1776@gmail.com
None
yourfavoriteband@g6.org
None
Username is Teacher / domain is Teacher


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [28]:
f = open("files/names.txt")

data = f.read()

f.close()
print(data)

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



##### with open()

In [30]:
with open ('files/names.txt') as f:
    data = f.read()
print(data)

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



##### re.match()

In [31]:
print(re.match("Hawkins, Derek", data))

# remember to pass in the string. In this case, our string is data

<re.Match object; span=(0, 14), match='Hawkins, Derek'>


##### re.search()

In [32]:
print(re.search("ripalp@codingtemple.com", data))

<re.Match object; span=(582, 605), match='ripalp@codingtemple.com'>


##### Store the String to a Variable

In [37]:
answer = input("What do you want to look for? ")

found = re.findall(answer, data)
# Basically looking for the "answer" from the data string

if found:
    print(f'here is your answer...{found}')
else:
    print("Sorry -- that is noot in this dataset")

What do you want to look for? Teacher
here is your answer...['Teacher', 'Teacher', 'Teacher']


### Homework Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [435]:
# Group them all and use as a vaiable since first name needs to come first
# instead of read, you can use f.readlines() to read the lines individually

print("===================")
print("Full Name / Twitter")
print("===================")

# lines = f.readlines()
# print(lines) --> won't work because the data is being read from a closed file

# Use: --> turned everything into a list format
with open ('files/names.txt') as f:
    data = f.readlines()
#     print(data[0]) # --> prints out index 0 of the list
    
# names = re.compile("([A-Z][a-z]+, [A-Z][a-z]+)")
# found_full_names = names.findall(data)
# print(found_full_names) --> won't work because this methond is for string, not list

# print(re.compile("([A-z][a-z]+)")) --> won't work because it's still testing a string not the list index

names_search = re.compile("([A-Z][a-z]+), ([\w -]*)([A-Z][a-z]+).*\s(@[a-zA-Z0-9]+$)") # --> case sensitive so if you add space, it will search for spaces
# ([A-Z][a-z]+) --> used for First and Last name
# ([\w -]*) --> used for 1 word, i.e. Hawkins, Derek, Teacher, Temple (I think)
# .*\s(@[a-zA-Z0-9]+$) --> used to find the twitter handle
# There are 4 capture groups in total here
# $ at the end asserts end of positioning
# * matches the previous token between (looks for repeats of the character before *) --> for multiple lines in the list
# . matches any character except for line terminators (\n)
# \w matches any word character similar to [a-zA-Z0-9]
# the - in \w- --> matches a single character in the list (it is case sensitive)
# \s matches any whitespace character (similar to a space or \n)

# how I would've written it: names_search = re.compile("[A-Z][a-z]+, ([\w]*)([A-Z][a-z]+).*(@[\w]+$)")


for i in data:
    valid = names_search.search(i)
#     print(valid) # --> will successfully print the match description of what is used in the compile
    
    
    if valid:
        print(f'{valid.group(3)} {valid.group(2)}{valid.group(1)} / {valid.group(4)}')
    
        
# {valid.group(3)} --> outputs Derek
# {valid.group(2)} --> outputs for the edge case of Sven
# {valid.group(1)} --> outputs Hawkins
# {valid.group(4)} --> outputs twitter handle

Full Name / Twitter
Derek Hawkins / @derekhawkins
Erik Sven-Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader


### Regex project

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [440]:
"""
Expected Output
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

with open ('files/regex_test.txt') as f:
    data = f.readlines()
#     print(data)
    
    full_name = re.compile("([A-Z][a-z]+) ([\w ]*)")
    
    for i in data:
        valid = full_name.search(i)
#         print(valid) 
#         --> testing
#     found = full_name.findall(data[2])
#     print(found)
    
    
        if valid:
            print(f'{valid.group(1)} {valid.group(2)}')
        else:
            print("None")



Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
