# CS 124 Tutorial: Regular Expressions

Based on the `CS 124: Jupyter and Python Tutorial` created by 
`Krishna Patel (Winter 2020)`, and updated by `Bryan Kim (Winter 2021)` and 
`Dilara Soylu (Winter 2022)`.

<a id='overview'></a>
## Overview

In this tutorial, we will walk you through some `Regular Expression` examples as
a preparation for our first assignments, `PA 1`. 
This tutorial assumes that you have completed the following from the 
[PA 0 repository](https://github.com/cs124/pa0-python-jupyter-tutorial) and you 
are familiar with `Python`:
* Setup instructions for your machine
* [Jupyter Notebook Tutorial](https://github.com/cs124/pa0-python-jupyter-tutorial/blob/main/jupyter_tutorial.ipynb)

<a id='contents'></a>
## Contents

1. [Environment Check](#environment_check)
2. [`Regular Expressions` Exercises](#regular_expressions_exercises)
3. [Answers](#answers)
4. [Next Steps](#next_steps)

<a id='environment_check'></a>
### Environment Check

Let's ensure that we are running our notebook in the correct environment.

In [1]:
# Check the name of the conda environment
import os
assert os.environ['CONDA_DEFAULT_ENV'] == "cs124"

# Check that the Python version is 3.8
import sys
assert sys.version_info.major == 3 and sys.version_info.minor == 8

If the above cell causes an error, it means that you are using the wrong 
environment or `Python` version!
If this is the case, please follow the troubleshotting steps shared in the 
[Jupyter Notebook Tutorial](https://github.com/cs124/pa0-python-jupyter-tutorial/blob/main/jupyter_tutorial.ipynb).

<a id='regular_expressions_exercises'></a>
## `Regular Expressions` Exercises

`Regular Expressions` (`RegEx`) are usually used to search for patterns in 
strings, or check if a string matches a pattern. 
`Python` has a regular expression module that helps us execute regular 
expressions on bodies of text. 

In [2]:
# Import the standard Python standard RegEx library
import re

In [None]:
# As an example, we can use a comma as our RegEx pattern and use this pattern
# to split a string

input_str = "a::b,c.d,e;:f,g"

# It is a good habit to mark RegEx patterns with the "r" prefix. In this
# case it doesn't matter, but "r" is needed for the RegEx to be read
# correctly when using special RegEx characters like \b, \w, etc. 

# This pattern matches a single comma
pattern = r","

# re.split splits the input string at the matching patterns
tokens = re.split(pattern, input_str)

tokens

In [None]:
# We could be a bit fancier, and allow our pattern to be any character in
# a set. Bracket notation [] indicates that we can match any of the characters
# in the brackets.

# This matches any ONE character in the set(a period, comma, semicolon, or
# colon). Note that although the period (".") is a special character in RegExes, we
# do not need to escpape using a backslash in this case because it is in a character class
# which is denoted using the [ ].  More generally, any character other than ^, -, \, or ] 
# is interpreted as a literal in a character class and does not need to be escaped.
pattern = r"[.,;:]"

tokens = re.split(pattern, input_str)

tokens

In [None]:
# We could even use special operators to describe more specific patterns.
# For example, the "+" operator means that it will match the object to its left
# at least once, but possibly multiple times. Note that if the object to the left
# is a set, it could be a different character from that set each time.

# This matches any sequence of one or more characters from the set
# [.,;:].
pattern = r"[.,;:]+"

tokens = re.split(pattern, input_str)

tokens

In [None]:
# We can also use RegExes to find all the times a specific pattern appears
# in a string. For example, what if we wanted to find all the instances of
# the word "dog" in a text.

text = """F
I love my dog Spot! Spot is the best dog in the world. He likes playing
with other dogs at the park. But he doesn't like cats, he is scared of them.
Today I will take him to the dog park. One time he saw a cat there and got so
scared he wanted to go home.
"""

# This matches just the 3-character sequence "dog"
pattern = r"dog"

tokens = re.findall(pattern, text)

tokens

In [None]:
# There are also other operators we can use, like ? to match the object to its
# left 0 or 1 times (in other words, it is optional).

# This matches "dog" followed by an optional s.
pattern = r"dog[s]?"

tokens = re.findall(pattern, text)

tokens

In [None]:
# Or we can match multiple possibilities (i.e. A or B)

# This matches "dog" or "cat".
pattern = r"(dog|cat)"

tokens = re.findall(pattern, text)

tokens

In [None]:
# A particularly common operator is the star "*" operator. It matches 0 or more
# of the object to its left. For example, we can use it to match any word that
# starts with an a (an a or A at the start of a word, followed by 0 or more of
# any letter).

# We can use the \b symbol to match the start of a word.
# and the \w symbol to match a letter.
# See https://docs.python.org/3/library/re.html for details and other
# special symbols.

# This matches any word starting with an a or A
pattern = r"\b[aA]\w*"

tokens = re.findall(pattern, text)

tokens

In [None]:
# Beyond the re.findall() function, we also often use re.search() if we want
# more flexible control over how we match/search
# (example from https://www.w3schools.com/python/python_regex.asp)

# re.search() is slightly more complicated than re.findall(), because it
# doesn't just return a list of matches as strings. It instead returns a Python
# match object that contains more detailed information about the match.

# NOTE: period (".") is a special character indicating any character (except
# a newline). "^" matches the start of a string and "$" matches the end of a
# string.

text = r"The rain in Spain"
match = re.search(r"^The.*Spain$", text)

# Print the match object
print(match)

# Find the first (and only) match
print(match[0])

# Get the original string back
print(match.string)

# Get the group (the entire part of the string where the match happened)
print(match.group())

# Find the span, the tuple (start_index, end_index).
# In other words the positions (from the start of the string, counting from
# 0), where the match starts and ends.
print("span: " + str(match.span()))

In [11]:
# We can also use capture groups (in parentheses) to indicate
# parts of the match that we want to save so that we can use them separately

# Using your knowledge of RegExes from the previous examples, what does this
# pattern mean? What do you expect it to do/match from text? What parts of text
# will fall in each of the 3 capture groups?
ai_match = re.search(r"(.*?)ai(.*?)ai(.*)", text)

In [None]:
# Let's check if your guesses were correct:

# Entire match
ai_match.group(0)

In [None]:
# First capturing group 
ai_match.group(1)

In [None]:
# Second capturing group
ai_match.group(2)

In [None]:
# Third capturing group
ai_match.group(3)

Let's try a more difficult example: 

__HTML Content Extraction__ Try to extract the inside of an HTML Tag that 
follows the following rules:

*   Start tags start with "<," have at least 1 alpha numeric character
and then end with ">"
*   End tags start with "</", have at least 1 alpha numeric characters
and then end with ">"
*   There can be any character between the two tags

Here are some examples example of the text:

`<html>`this is what we want to extract`</html>`

don't want`<h1>`what we want `</h1>` don't want

---

If you want to make it harder make sure you pass this test case, since
technically HTML tags must match (the contents of the start tag must match
the contents of the end tag):

`<html>`this is what we want to extract `</h1>` `</html>` 

You should extract: this is what we want to extract `</h1>`

This problem was inspired by the `Regex Examples` in this 
[`link`](https://www.sitepoint.com/demystifying-regex-with-practical-examples/).

In [16]:
# Try to solve this here

test_str1 = "<html>this is what we want to extract</html>"
test_str2 = "don't want<h1>what we want </h1> don't want"

hard_case = "<html>this is what we want to extract </h1> </html>"

Solved it? You can find the answer in the `Solutions` section.

Let's try a longer example: 

__URL Matching__ Try to match a URL that follows the following rules:

* the URL must start with http or https followed by ://
* the domain name can only be alphanumeric or contain "." or "-"
* can contain a port specification (http://abc.com:80) (you can assume ports
go from 0 to 99)
* after the port, the URL can contain any number of  alphanumeric digits,
dots and hyphens

This problem was inspired by the `Regex Examples` in this 
[`link`](https://www.sitepoint.com/demystifying-regex-with-practical-examples/).

In [None]:
# Try to solve this here

test_str1 = "http://www.google.com"
test_str2 = "https://www.gmail.com:88/hello-hi"
test_str3 = "http://abd-fh.8rhgyt.org:90/h-"


<a id='answers'></a>
## Answers

You can find the answers to the last two exercises below.

__HTML Content Extraction__

In [None]:
import re
test_str1 = "<html>this is what we want to extract</html>"
test_str2 = "don't want<h1>what we want </h1> don't want"

# Answer 1 -- works on first 2 test strings
match = re.search(r"<[\w]+>(.*?)</[\w]+>", test_str1)

match.group(1)

In [None]:
hard_test = "<html>this is what we want to extract </h1> </html>"

# Answer 2 -- works on hard case as well
match = re.search(r"<([\w]+)>(.*?)<\/\1>", hard_test)

match.group(2)

# NB: \1 will match the same text matched by the first capturing group

__URL Matching__

In [None]:
import re
test_str1 = "http://www.google.com"
test_str2 = "https://www.gmail.com:88/hello-hi"
test_str3 = "http://abd-fh.8rhgyt.org:90/h-"

match = re.search(r"https?://([a-zA-Z0-9-\.]+)(:[0-9][0-9]?)?/?([a-zA-Z0-9-\.]*)", test_str3)

match.group(0)

<a id='next_steps'></a>
## Next Steps

You can now head start the first assignment!
If you want more practice with regular expressions, you can try the following:
* [learnpython problem](https://www.learnpython.org/en/Regular_Expressions)