## Homework: Regular Expressions


### University of Virginia
### Foundations of Computer Science
### Last Updated: November 13, 2021
---

### Objectives: 
- Practice writing and testing regular expressions

### Executive Summary


There are two short text documents in this notebook. You will write regular expressions to find certain patterns.  

Note: This website is a helpful resource for writing and testing regexes: [regex101](https://regex101.com/)

### Instructions

Answer the questions, showing all code and results.  
When the file is completed, submit the notebook through Collab.

**Notes:**  
1) When instructions ask for a case insensitive match on a word or phrase, any mix of uppercase and lowercase characters are a match.  
2) The regexes do not need to be robust generally. They simply need to find all the matches in the documents. For example, when matching dollar amounts,  
   the regex does not need to guard against matching invald forms such as $61,0 as they are not in the documents. 

**TOTAL POINTS: 12**

---


In [574]:
import re

#### DOCUMENTS FOR SEARCH

In [4]:
doc1 = "(CNN) This is an article about America's Workers. Getting family health insurance on the job now costs workers and their employers more than $22,000 a year, on average. And companies have not been able to do much to make coverage more affordable, even though the coronavirus pandemic has reinforced the importance of health benefits.\
Employees foot about $6,000 of the tab, while companies pick up the rest, according to the 2021 Kaiser Family Foundation Employer Health Benefits Survey. The report, released Wednesday, found that the average annual premium rose 4% this year to $22,221.\
The average annual premium for a single staffer in 2021 hit $7,739, also up 4%. Workers pay about $1,300, and employers cover the remaining tab.\
About 155 million Americans rely on employer-sponsored coverage -- and they are paying a lot more for that benefit than they were a decade ago. The average family premium has increased 47%, more than wages or inflation, which rose 31% and 19%, respectively, Kaiser found.\
The average count is 21,000."

In [5]:
doc2 = "Curry reacts in the second half against the Chicago Bulls. (CNN)It seems every week NBA superstar Steph Curry is making history.\
Earlier this week, he overtook Wilt Chamberlain to become the oldest player to record 50 points and 10 assists in a game.\
And on Friday night, the 33-year-old passed basketball great Ray Allen for the most three-pointers scored in all NBA games, including playoffs, in NBA history.\
Curry connected with nine of his 17 three-point attempts in the Golden State Warriors' 119-93 win over the Chicago Bulls, taking his tally in regular season and playoff games to 3,366, surpassing Allen's total of 3,358.\
He had come into the game just one behind two-time NBA champion Allen and equaled his record within the first few minutes of the game.\
And he became the all-time lead just minutes later, drilling a long-range effort over Alex Caruso."

---

#### 1) (1 POINT) Search *doc1* for the word 'family', print the matches, and print the number of matches.

In [575]:
q1 = r"family"
matches = re.finditer(q1, doc1)

In [576]:
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(58, 64), match='family'>)
(1, <re.Match object; span=(886, 892), match='family'>)


In [577]:
print("The number of matches is", len(list(re.finditer(q1, doc1))))

The number of matches is 2


#### 2) (2 POINTS) Search *doc1* for the first occurrence of the word "workers" (case insensitive).  
####    If it finds a match, use the start() and end() methods to extract the match from the document, printing the result.

In [578]:
q2 = r"(?i)workers"
matches = re.finditer(q2, doc1)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(41, 48), match='Workers'>)
(1, <re.Match object; span=(103, 110), match='workers'>)
(2, <re.Match object; span=(666, 673), match='Workers'>)


In [579]:
search_pattern = re.search(q2,doc1)
 
start = search_pattern.start()
end= search_pattern.end()

print("Found the first case insensitive match at position:", start, end, "-- found:", doc1[start:end])


Found the first case insensitive match at position: 41 48 -- found: Workers


#### 3) (1 POINT) Search *doc1* for the word 'family' (case insensitive), print the matches, and print the number of matches.

In [580]:
q3 = r"(?i)family"
matches = re.finditer(q3, doc1)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(58, 64), match='family'>)
(1, <re.Match object; span=(436, 442), match='Family'>)
(2, <re.Match object; span=(886, 892), match='family'>)


In [581]:
print("The number of matches is", len(list(re.finditer(q3, doc1))))

The number of matches is 3


#### 4) (1 POINT) Search *doc1* for dollar amounts, print the matches, and print the number of matches. Dollar amounts start with "$" followed by digits and possibly commas.

Note: "$" will have different meanings in a regex, so take care to use it properly in this context.

In [582]:
q4 = "\$\d+(?:\,\d+)?"
matches = re.finditer(q4, doc1)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(141, 148), match='$22,000'>)
(1, <re.Match object; span=(354, 360), match='$6,000'>)
(2, <re.Match object; span=(578, 585), match='$22,221'>)
(3, <re.Match object; span=(646, 652), match='$7,739'>)
(4, <re.Match object; span=(684, 690), match='$1,300'>)


In [583]:
print("The number of matches is", len(list(re.finditer(q4, doc1))))

The number of matches is 5


#### 5) (2 POINTS) Search *doc1* for numbers that are not percentages nor dollar amounts. Print the matches, and print the number of matches.


Examples:  
55 is a match, and 55,000 is a match, and 55. is a match (the last could occur at the end of a sentence, for example.  
$55,000 is not a match, and 55% is not a match


In [642]:
# Assuming that the number can't also start with 0, searching for [1-9] only
# Used negative lookahead and lookback to remove $ and %, looks a little hard to read
q5 = r"(?<!\$)[1-9]+(?!%)(?:[\.,\s]\d+)?" 
matches = re.finditer(q5, doc1)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(143, 148), match='2,000'>)
(1, <re.Match object; span=(424, 425), match='2'>)
(2, <re.Match object; span=(426, 428), match='21'>)
(3, <re.Match object; span=(580, 585), match='2,221'>)
(4, <re.Match object; span=(637, 638), match='2'>)
(5, <re.Match object; span=(639, 641), match='21'>)
(6, <re.Match object; span=(649, 652), match='739'>)
(7, <re.Match object; span=(687, 688), match='3'>)
(8, <re.Match object; span=(736, 739), match='155'>)
(9, <re.Match object; span=(915, 916), match='4'>)
(10, <re.Match object; span=(961, 962), match='3'>)
(11, <re.Match object; span=(969, 970), match='1'>)
(12, <re.Match object; span=(1022, 1028), match='21,000'>)


In [643]:
print("The number of matches is", len(list(re.finditer(q5, doc1))))

The number of matches is 13


#### The following questions ask you to search doc2.

#### 6) (2 POINTS) Search *doc2* for two or more words (consisting of only letters) joined by dashes. Print the matches, and print the number of matches.

Examples: "twenty-year-old" and "all-star"  
Non-examples: '22-year' and '110-90' are not matches as they contain numbers


In [644]:
q6 = r"[a-zA-z]+[-][a-zA-z]+"
matches = re.finditer(q6, doc2)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(277, 285), match='year-old'>)
(1, <re.Match object; span=(333, 347), match='three-pointers'>)
(2, <re.Match object; span=(444, 455), match='three-point'>)
(3, <re.Match object; span=(669, 677), match='two-time'>)
(4, <re.Match object; span=(779, 787), match='all-time'>)
(5, <re.Match object; span=(824, 834), match='long-range'>)


In [645]:
print("The number of matches is", len(list(re.finditer(q6, doc2))))

The number of matches is 6


#### 7) (1 POINT) Search *doc2* for all words starting with an uppercase letter.  Print the matches, and print the number of matches. 

In [646]:
q7 = r"\b[A-Z]+[a-z]+"
matches = re.finditer(q7, doc2)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(0, 5), match='Curry'>)
(1, <re.Match object; span=(44, 51), match='Chicago'>)
(2, <re.Match object; span=(52, 57), match='Bulls'>)
(3, <re.Match object; span=(64, 66), match='It'>)
(4, <re.Match object; span=(98, 103), match='Steph'>)
(5, <re.Match object; span=(104, 109), match='Curry'>)
(6, <re.Match object; span=(128, 135), match='Earlier'>)
(7, <re.Match object; span=(159, 163), match='Wilt'>)
(8, <re.Match object; span=(164, 175), match='Chamberlain'>)
(9, <re.Match object; span=(249, 252), match='And'>)
(10, <re.Match object; span=(256, 262), match='Friday'>)
(11, <re.Match object; span=(310, 313), match='Ray'>)
(12, <re.Match object; span=(314, 319), match='Allen'>)
(13, <re.Match object; span=(408, 413), match='Curry'>)
(14, <re.Match object; span=(472, 478), match='Golden'>)
(15, <re.Match object; span=(479, 484), match='State'>)
(16, <re.Match object; span=(485, 493), match='Warriors'>)
(17, <re.Match object; span=(515, 522), match='Chicago'>)
(18,

In [647]:
print("The number of matches is", len(list(re.finditer(q7, doc2))))

The number of matches is 25


#### 8) (1 POINT) Search *doc2* for the word "in." Print the matches, and print the number of matches. 

Example: "Jordan is *in* the house  
Non-example: Jordan is ready to win (careful not to match on the substring "in" in "win")

In [648]:
q8 = r"\bin\b"
matches = re.finditer(q8, doc2)
for match in enumerate(matches):
    print(match)

(0, <re.Match object; span=(13, 15), match='in'>)
(1, <re.Match object; span=(239, 241), match='in'>)
(2, <re.Match object; span=(355, 357), match='in'>)
(3, <re.Match object; span=(393, 395), match='in'>)
(4, <re.Match object; span=(465, 467), match='in'>)
(5, <re.Match object; span=(547, 549), match='in'>)


In [649]:
print("The number of matches is", len(list(re.finditer(q8, doc2))))

The number of matches is 6


#### 9) (1 POINT) Search *doc2* for a number followed by the word "points."  
####    Include capture groups in the regex to extract the number of points, and print the number.  
####    Credit is only given if you use capture groups in this exercise.
Hint: use the search() function.


In [650]:
q9 = r"\b([0-9]+\s)(points)"
matches = re.finditer(q9, doc2)
for match in matches:
    print("Number of points is", match.group(1))

Number of points is 50 
