# **In-Class Assignment: Basics in Regular Expressions**

## *IS 5150*
## Name: KEY

In this in-class assingment we will cover some basic functions of regex to give you an idea of how you might use it for pattern matching/text extraction purposes. Regex is a pain, but it's also a useful tool that will come up repeatedly in this course, so it's good to have at least a basic enough understanding so that you become confident enough to Google solutions when you need them (because most people don't memorize regex...).

Let's begin by importing `re` which is our regex library for python.

In [None]:
import re # import re

## **Basic Applications of `re`**

Let's explore some of the basic functions in the `re` library for python:

### **Searching Strings**
Using the `search` function from `re` produce a regex pattern and search the provided strings for any character (e.g., a-z, A-Z, 0-9); use a print statement to return whether each string has a match or not.

In [None]:
string1 = "&^%$#@"
string2 = "Other bookmarks"

pattern = r'[a-zA-Z0-9]'
print(re.search(pattern, string1))
print(re.search(pattern, string2))

None
<re.Match object; span=(0, 1), match='O'>


### **Findall**

Now, using the `findall` function from `re` print all matches using the same regex pattern in the provided string.

In [None]:
str = 'it is the way & not'
pattern = r'[a-zA-Z0-9]'

re.findall(pattern, str)

['i', 't', 'i', 's', 't', 'h', 'e', 'w', 'a', 'y', 'n', 'o', 't']

### **Kleenes**
Add in a *kleene* operator to your regex pattern to return the matches as sequences of words (e.g., "it is the way" vs. returning individual character mateches).

In [None]:
str = 'it is the way & not'
pattern = r'[a-zA-Z0-9]+'

re.findall(pattern, str)

['it', 'is', 'the', 'way', 'not']

### **Substitutions**
Use the `sub` function from `re` to remove all leading zeros from the provided IP address:

In [None]:
str = '100.001.055.255'
print(re.sub('\.[0]*', '.', str))

100.1.55.255


## **Regex Operators**

Next we'll explore some regex operators we learned about in class, given the following string: "https://www.youtube.com/watch?v=ewgCqJDI_Nk"

In [None]:
str = "https://www.youtube.com/watch?v=ewgCqJDI_Nk"

### **Disjunction**
Using *disjunction*, search for the letters `u`, `o`, or `f` in the provided string:

In [None]:
pattern = r'[uof]'

print(re.search(pattern, str))

<re.Match object; span=(17, 18), match='o'>


### **Range**

Using *Range* find all **sequences** of lowercase letters in the provided string:

In [None]:
pattern = r'[a-z]+'

print(re.findall(pattern, str))

['https', 'www', 'youtube', 'com', 'watch', 'v', 'ewg', 'q', 'k']


### **Negation**

Using *Negation* search for the first instance of a non-character (e.g., &,^,$) in the provided string:

In [None]:
pattern = r'[^A-Za-z0-9]'
print(re.search(pattern, str))

<re.Match object; span=(5, 6), match=':'>


### **Optionality**

Sometimes urls use 'http', while othertimes they use 'https' in their web addresses. Use *Optionality* to search for a match of http *or* https in the provided string:

*hint: also use concatenation to search for the entire string of http and https*

In [None]:
pattern = r'https?'
print(re.search(pattern, str))

<re.Match object; span=(0, 5), match='https'>


### **Wildcard**


Use the *wildcard* operator in combination with a *kleene* to idenitfy the string following 'watch' in the YouTube url that indicates the unique video ID sequence (*hint: remember that ? is a special character.*):

In [None]:
str

'https://www.youtube.com/watch?v=ewgCqJDI_Nk'

In [None]:
pattern = r'\?.+'
print(re.findall(pattern, str))

['?v=ewgCqJDI_Nk']


## **Shorthand Operators**

Next let's try out some of the additional shorthand operators, using this Wikipedia paragraph:

>> *Pennatomys nivalis is an extinct oryzomyine rodent from the islands of Sint Eustatius, Saint Kitts, and Nevis in the Lesser Antilles (range pictured). It is known from skeletal remains found in Amerindian archeological sites on all three islands, with dates ranging from 790–520 BCE to 900–1200 CE.*

In [None]:
str = "Pennatomys nivalis is an extinct oryzomyine rodent from the islands of Sint Eustatius, Saint Kitts, and Nevis in the Lesser Antilles (range pictured). It is known from skeletal remains found in Amerindian archeological sites on all three islands, with dates ranging from 790–520 BCE to 900–1200 CE."

### **ID numeric characters**
Use a regex shorthand operator to identify all numeric sequences in the given paragraph:

In [None]:
pattern = r'\d+'
print(re.findall(pattern, str))

['790', '520', '900', '1200']


### **ID numeric sequences of length x,y**
Use regex shorthand operators to identify all numeric sequences between 1-4 digits long:

In [None]:
pattern = r'\d{1,4}'
print(re.findall(pattern, str))

['790', '520', '900', '1200']


### **ID words at sentence end**
Use regex shorthand operators to search for a word of any length at the end of a string with punctuation.

In [None]:
pattern = '\w+\S$'
print(re.search(pattern, str))

<re.Match object; span=(295, 298), match='CE.'>


## **Regex for search and subsitution**

Finally, let's use regex to search and substitute substrings within a larger string.

Search for all matches of any five greeting words within the provided string (hint: remember to search both upper and lowercase):

In [None]:
str = 'You say, Goodbye and I say, Hello, hello, hello. I do not know why you say, Goodbye, I say, Hello, hello, hello.'

In [None]:
patterns = ['[Hh]ello', '[Gg]oodbye', '[Gg]ood day', '[Gg]reetings', '[Ss]alutations']
for pattern in patterns:
    if re.search(pattern,  str):
        print('Matched!')
    else:
        print('Not Matched!')

Matched!
Matched!
Not Matched!
Not Matched!
Not Matched!



Now use the `sub` function from `re` to replace whitespaces with an underscore and vice versa.

In [None]:
str = 'ID Number Column'
str2 = 'First_Name'

In [None]:
re.sub(" ", "_", str)
re.sub("_", " ", str2)

'First Name'

### **Extra Credit (1 pt):** 

Use regex to search for the year, month, and day from an article url (*hint: yyyy/mm/dd*)

In [None]:
str = 'https://www.cnbc.com/2022/07/15/millennials-are-to-blame-for-sky-high-inflation-strategist-says.html'

In [None]:
pattern = r'/(\d{4})/(\d{1,2})/(\d{1,2})/'
re.search(pattern, str)

<re.Match object; span=(20, 32), match='/2022/07/15/'>