## Week 8: Regular Expressions ##

#### Task 1: Extract the names of each individual from the unformatted text string and store them in a vector of some sort. ####

**Unformatted Text:** 
"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

In [1]:
import re
import numpy as np

#initialize data
text_data = '''555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 
             8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert'''
text_data #prints initial unformatted text data

'555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 \n             8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert'

**Approach:** Search for the pattern that is included in the set. These are any number (0-9) and special characters '-' and '()' and also ignores the new line. This pattern `[0-9-()\n]` splits the text data and extracts the names of the residents in the town of Springfield. The second part of the list comprehension loops through the list created by the split function and finds a match that contains a group of letters and '.' at the beginning and end and also, a '.' `(\w+.).(\w+.)`. The string will only be added into the list if there is a match as seen below:

In [2]:
#splits the unformatted text to find names and then matches it with the pattern and returns matches in a list
extract = [x for x in re.split('[0-9-()\n]',text_data) if re.match('(\w+.).(\w+.)',x)]
extract

['Moe Szyslak',
 'Burns, C. Montgomery',
 'Rev. Timothy Lovejoy',
 'Ned Flanders',
 'Simpson, Homer',
 'Dr. Julius Hibbert']

#### Task 2: Extract the names of each individual from the unformatted text string and store them in a vector of some sort. ####

***A. Using your new vector containing only the names of the six individuals, complete the following tasks:
a. Use your regex skills to rearrange the vector so that all elements conform to the standard "firstname lastname", preserving any titles (e.g. "Rev.", "Dr.", etc.) or middle/second names.***

**Approach:** The approach here is to loop through the list extract created above and search for a comma (,). This denotes that the full name is not written as the standard "firstname lastname". If there is a match, the name will then be split also using the same pattern comma (,). The split creates a list with two elements and once it is separated into two elements, the elements are then reversed to follow the standard "firstname lastname" then re-added to the list. Otherwise, the string is just added into the list. Found below is the result after applying this approach.

In [3]:
names = [] #creates a list to store names in the proper standard "firstname lastname"
for n in extract: #loops through each element on the extract list 
    pattern = re.compile(', ') 
    match = re.search(pattern, n) #searches for a comma (,) on the string
    if match: #if there is a match, splits the text where the comma is
        full = re.split(pattern, n)
        names.append(full[1]+ ' ' + full[0]) #reverses the order of the name to the standard "firstname lastname"
    else:
        names.append(n) 
names

['Moe Szyslak',
 'C. Montgomery Burns',
 'Rev. Timothy Lovejoy',
 'Ned Flanders',
 'Homer Simpson',
 'Dr. Julius Hibbert']

***B. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).***

**Approach:** The pattern `[A-Z][a-z]{1,}[.]` finds a match that has any capital lettered alphabet set as the first character and another lower case alphabet set as the second character. We are looking for 1 or more lower case letters then a '.' afterwards. This indicates that the group of text is actually a title. If this is true, a match is found and returned. The result is translated into a numpy array and then a condition is set such that if the element is None, a False will be returned else, True.

In [4]:
#searches for names with title and returns an array of the elements that have a match
title = np.array([re.match('[A-Z][a-z]{1,}[.]',t) for t in names]) 
print(title)

[None None <re.Match object; span=(0, 4), match='Rev.'> None None
 <re.Match object; span=(0, 3), match='Dr.'>]


In [5]:
print(title!=None) #returns True if there is a match otherwise False

[False False  True False False  True]


***C. Construct a logical vector indicating whether a character has a middle/second name.***

**Approach:** The pattern `[A-Z][.]` finds a match that has any capital lettered alphabet set as the first character and a '.' after. This indicates that the group of text is actually an initial and thus, tells us that this is the person's first initial and therefore, there is a second name. If this is true, a match is found and returned. The result is translated into a numpy array and then a condition is set such that if the element is None, a False will be returned else, True.

In [6]:
#searches for names with middle/second name and returns an array of the elements that have a match
sec_name = np.array([re.match('[A-Z][.]',t) for t in names ])
print(sec_name)

[None <re.Match object; span=(0, 2), match='C.'> None None None None]


In [7]:
print(sec_name!=None) #returns True if there is a match otherwise False

[False  True False False False False]


#### Task 3: ####

**Question:** Consider the HTML string `<title>+++BREAKING NEWS+++<title>`. We would like to extract the first HTML tag(i.e, `<title>`. To do so, we write the regular expression `<.+>`. Explain why this fails and correct the expression.

In [8]:
html_text = '<title>+++BREAKING NEWS+++<title>' 
pattern_wrong = '<.+>' #incorrect pattern
html_wrong = re.match(pattern_wrong, html_text).group()
html_wrong

'<title>+++BREAKING NEWS+++<title>'

**Explanation:** This `<.+>` fails because as seen above, this results to extracting the entire string. This is because the '+' sign is a greedy sign meaning, they match as much text as possible thus, extracting the entire string because you find < as the first string and > as the last one which is the end html `<title>` tag. To solve this, we need to add a qualifier which is **?** after the + sign so it will only get the first html tag. The new pattern to use should be `<.+?>`. Found below is the result after applying the correct pattern. 

In [9]:
pattern_h = '<.+?>' #correct pattern
html_right = re.match(pattern_h, html_text).group()
html_right

'<title>'

#### Task 4: ####

**Question:** Consider the string `(5-3)^2=5^2-2*5*3+3^2` conforms to the binomial theorem. We would like to extract the formula in the string. To do so, we write the regular expression `[^0-9=+*()]+`. Explain why this fails and correct the expression.

In [10]:
theorem = '(5-3)^2=5^2-2*5*3+3^2'
pattern_wrong = '[^0-9=+*()]+' #incorrect pattern
theorem_wrong = str(re.findall(pattern_wrong,theorem)).strip("['']")
theorem_wrong 

"-', '^', '^', '-', '^"

**Explanation:** This `[^0-9=+*()]+` fails because as seen above, using the findall function, this results to extracting only the special characters, that is, '-' and '^' signs. This is because the '^' when placed inside `[]` ignores all characters that are placed inside the set. In this case, since '-' and '^' are not inside the set, the pattern finds a match and returns this. However, the other characters such as any number `(0-9), =, +, *, and ()` are ignored. To solve this, the caret (^) sign is removed at the beginning of the set or the pattern inside the `[]`. In addition, a '-' and '^' where added after finding any number (0-9) so that the entire formula can be extracted. The new pattern used is `[(0-9)-^=+*()]+`. Found below is the result after applying the correct pattern and using the same function findall. 

In [11]:
pattern = '[(0-9)-^=+*()]+' #correct pattern
theorem_text = str(re.findall(pattern,theorem)).strip("['']")
theorem_text 

'(5-3)^2=5^2-2*5*3+3^2'