<a href="https://colab.research.google.com/github/emmar2000/DSCI330/blob/main/module2_lectures/1_7_introduction_to_regular_expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regular Expressions

<img src="https://imgs.xkcd.com/comics/regular_expressions.png " width=400>

## RegEx Golf

<img src="https://imgs.xkcd.com/comics/regex_golf.png" width=600>

## Perl Problems

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski 
<img src="https://imgs.xkcd.com/comics/perl_problems.png" width=600>

## <font color="red"> Exercise 1 </font>

**Question:** What do all of these strings have in common?

* `"ab"`
* `"abab"`
* `"ababcde"`
* `"abcde"`

They start with "ab"

## <font color="red"> Exercise 2 </font>

**Task:** Predict the next three elements of the sequence.

* `"0"`
* `"1ab"`
* `"2abab"`
* `"3ababab"`


"4abababab", "5ababababab", "6abababababab"

[str(i) + i*"ab" for i in range(4,7)]

## What is a regular expression

* Language for matching regular patterns
    * `"ab"`
    * `"abab"`
    * `"ababcde"`
    * `"abcde"`
* Can't match other, context sensitive, patterns
    * `"0"`
    * `"1ab"`
    * `"2abab"`
    * `"3ababab"`

## <font color="red"> Exercise 3 </font>

**Tasks:** 

1. Go to [RegexOne](https://regexone.com/)
2. Read the first section
3. Define a pattern that matches the first three strings.

In [None]:
pattern1 = "^abc"

## Regular Expressions Workflow in Python

* `import re`
* Compile a pattern `regex = re.compile(pattern)`
    * Use `r"pattern"` to avoid escaping symbols
* `m = regex.match(some_string)`

regex.match - searches only from beginning of string

regex.search - searches throughout string
> or provide optional start/end positions to consider

In [None]:
import re

In [None]:
pattern = r"ab" # r is for raw
regex = re.compile(pattern)
m1 = regex.match('abc')
m1

<_sre.SRE_Match object; span=(0, 2), match='ab'>

In [None]:
help(regex.search)

Help on built-in function search:

search(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) method of _sre.SRE_Pattern instance
    Scan through string looking for a match, and return a corresponding match object instance.
    
    Return None if no position in the string matches.



## `match` returns `None` when there is no match

In [None]:
m2 = regex.match('acb')
m2

In [None]:
m2 is None

True

In [None]:
not m2 

True

## The result of a match can be used in Boolean expressions

In [None]:
"Yes" if m1 else "No"

'Yes'

In [None]:
"Yes" if m2 else "No"

'No'

## <font color="red"> Exercise 4 </font>

**Tasks:** Make sure your pattern from <font color="red"> Exercise 3</font> passes the following `assert` statements.

In [None]:
pattern1 = "^abc"
my_regex = re.compile(pattern1)
assert my_regex.match('abcdefg')
assert my_regex.match('abcde')
assert my_regex.match('abc')
assert not my_regex.match('acb')

## Whitespace and escaped characters

* Whitespace includes spaces, tabs, and newlines
* Python uses escape characters: `"\t"`, `"\n"`

#### Use `"\n"` for newlines

In [None]:
"\n"

'\n'

In [None]:
print('\n')





In [None]:
print('a\nb')

a
b


#### Use `"\t"` for tab

In [None]:
"\t"

'\t'

In [None]:
print('\t')

	


In [None]:
print('a\tb')

a	b


## Why use `r"raw strings"` in `regex`

* Regular strings $\rightarrow$ `\` is for special characters: `'\n'`, `'\t'`
* In regular expressions, `\` is for
    * Escaping: i.e. `\.` vs. `.`
* Without raw string, we would need 
    * `'\\n'`  to match a new line
    * `'\\t'` to match a tab

In [None]:
r"\n" # Raw string allow us to match newlines without the extra \

'\\n'

## Important `match object` methods

<table class="docutils" border="1">
<colgroup>
<col width="29%">
<col width="71%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Method/Attribute</th>
<th class="head">Purpose</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">group()</span></code></td>
<td>Return the string matched by the RE</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">start()</span></code></td>
<td>Return the starting position of the match</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">end()</span></code></td>
<td>Return the ending position of the match</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">span()</span></code></td>
<td>Return a tuple containing the (start, end)
positions  of the match</td>
</tr>
</tbody>
</table>

In [None]:
m1.group()

'ab'

In [None]:
m1.start(), m1.end()

(0, 2)

In [None]:
m1.span()

(0, 2)

## Be sure to check for `None`

In [None]:
m2 # example that DIDN'T match 

In [None]:
m2 is None # non-matches return None

True

In [None]:
m2.group() # Oh you silly 'Nonetype' errors

## Solution 1

Always check for `None`

In [None]:
m2.group() if m2 else None

## Solution 2

Learn about the [`Maybe` monad](https://en.wikipedia.org/wiki/Monad_(functional_programming)#An_example:_Maybe). (We will tackle this in a later chapter.)

## Python regex methods

<table class="docutils" border="1">
<colgroup>
<col width="28%">
<col width="72%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Method/Attribute</th>
<th class="head">Purpose</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">match()</span></code></td>
<td>Determine if the RE matches at the beginning
of the string.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">search()</span></code></td>
<td>Scan through a string, looking for any
location where this RE matches.</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">findall()</span></code></td>
<td>Find all substrings where the RE matches, and
returns them as a list.</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">finditer()</span></code></td>
<td>Find all substrings where the RE matches, and
returns them as an <a class="reference internal" href="../glossary.html#term-iterator"><span class="xref std std-term">iterator</span></a>.</td>
</tr>
</tbody>
</table>

**Source:** [Python documentation](https://docs.python.org/3/howto/regex.html)

## A better `str.replace`

* Often chain many `replace` calls
* Example `s.replace('(', '').replace(')','').replace(':', '')`
* We can use `re.sub` to simplify.

In [None]:
s = "The string (has) some: things in (it)"
s.replace('(', '').replace(')','').replace(':', '')

'The string has some things in it'

In [None]:
re.sub(r"[():]", '', s)

'The string has some things in it'

## Substitutions with a compiled RegEx

1. Compile a pattern
2. Use `pat.sub(new_substr, s)`

In [None]:
paren_or_colon = re.compile(r"[():]")
paren_or_colon.sub('', s)

'The string has some things in it'

## <font color="red"> Exercise 4 </font>

**Task:** Write and test a function that uses `re.sub` to remove all punctuation from a string.  **Hint:** Use the `punctuation` variable from the `string` module.

In [None]:
import string

def strip_punctuation(s: str) -> str:
  """Removes punctuation from a string.

  Args:
    s: A string

  Returns:
    String without punctuation
  """
  patt = re.compile(rf"[{re.escape(string.punctuation)}]")
  return re.sub(patt, "", s)

In [None]:
def test_strip_punctuation():
  assert strip_punctuation("This???@@##$ is#$% a%$^ test$% ^&string") == "This is a test string"

test_strip_punctuation()

## Next Up

Now you should read through the rest of [RegExOne](https://regexone.com/) and put your work in [Lab 3](./lab_3_regexone.ipynb).

** Changed my_regex.match() to .search() for compatibility with website's any position matches

In [None]:
# Lesson 1 1/2
pattern = r"\d+"
my_regex = re.compile(pattern)
assert my_regex.search('abc123xyz')
assert my_regex.search('define "123"')
assert my_regex.search('var g = 123')

In [None]:
# Lesson 2
pattern = r"\."
my_regex = re.compile(pattern)
assert my_regex.search('cat.')
assert my_regex.search('896.')
assert my_regex.search('?=+.')
assert not my_regex.search('abc1')

In [None]:
# Lesson 3
pattern = r"[cmf]an"
my_regex = re.compile(pattern)
assert my_regex.search('can')
assert my_regex.search('man')
assert my_regex.search('fan')
assert not my_regex.search('dan')
assert not my_regex.search('ran')
assert not my_regex.search('pan')

In [None]:
# Lesson 4
pattern = r"[^b]og"
my_regex = re.compile(pattern)
assert my_regex.search('hog')
assert my_regex.search('dog')
assert not my_regex.search('bog')

In [None]:
# Lesson 5
pattern = r"[A-C][n-p][a-c]"
my_regex = re.compile(pattern)
assert my_regex.search('Ana')
assert my_regex.search('Bob')
assert my_regex.search('Cpc')
assert not my_regex.search('aax')
assert not my_regex.search('bby')
assert not my_regex.search('ccz')

In [None]:
# Lesson 6
pattern = r"waz{3,}up"
my_regex = re.compile(pattern)
assert my_regex.search('wazzzzzup')
assert my_regex.search('wazzzup')
assert not my_regex.search('wazup')

In [None]:
# Lesson 7
pattern = r"aa+b*c+"
my_regex = re.compile(pattern)
assert my_regex.search('aaaabcc')
assert my_regex.search('aabbbbc')
assert my_regex.search('aacc')
assert not my_regex.search('a')

In [None]:
# Lesson 8
pattern = r"\d+ files? found\?"
my_regex = re.compile(pattern)
assert my_regex.search('1 file found?')
assert my_regex.search('2 files found?')
assert my_regex.search('24 files found?')
assert not my_regex.search('No files found.')

In [None]:
# Lesson 9
pattern = r"\d\.\s+abc"
my_regex = re.compile(pattern)
assert my_regex.search('1. abc')
assert my_regex.search('2.  abc')
assert my_regex.search('3.   abc')
assert not my_regex.search('4.abc')

In [None]:
# Lesson 10
pattern = r"^Mission: successful$"
my_regex = re.compile(pattern)
assert my_regex.search('Mission: successful')
assert not my_regex.search('Last Mission: unsuccessful')
assert not my_regex.search('Next Mission: successful upon capture of target')

In [None]:
# Lesson 11
pattern = r"^(file_.+)\.pdf$"
my_regex = re.compile(pattern)
assert my_regex.search('file_record_transcript.pdf').group(1) == "file_record_transcript"
assert my_regex.search('file_07241999.pdf').group(1) == "file_07241999"
assert not my_regex.search('testfile_fake.pdf.tmp')

In [None]:
# Lesson 12
pattern = r"^(\w{3}\s+(\d{4}))$"
my_regex = re.compile(pattern)
assert my_regex.search('Jan 1987').group(1) == "Jan 1987"
assert my_regex.search('Jan 1987').group(2) == "1987"
assert my_regex.search('May 1969').group(1) == "May 1969"
assert my_regex.search('May 1969').group(2) == "1969"
assert my_regex.search('Aug 2011').group(1) == "Aug 2011"
assert my_regex.search('Aug 2011').group(2) == "2011"

In [None]:
# Lesson 13
pattern = r"(\d+)x(\d+)"
my_regex = re.compile(pattern)
assert my_regex.search('1280x720').group(1) == "1280"
assert my_regex.search('1280x720').group(2) == "720"
assert my_regex.search('1920x1600').group(1) == "1920"
assert my_regex.search('1920x1600').group(2) == "1600"
assert my_regex.search('1024x786').group(1) == "1024"
assert my_regex.search('1024x786').group(2) == "786"

In [None]:
# Lesson 14
pattern = r"I love (cats|dogs)"
my_regex = re.compile(pattern)
assert my_regex.search('I love cats')
assert my_regex.search('I love dogs')
assert not my_regex.search('I love logs')
assert not my_regex.search('I love cogs')

In [None]:
# Lesson 15
pattern = r"[\w\s\b\.&$#*@!%]*"
my_regex = re.compile(pattern)
assert my_regex.search('The quick brown fox jumps over the lazy dog.')
assert my_regex.search('There were 614 instances of students getting 90.0% or above.')
assert my_regex.search('The FCC had to censor the networkfor saying &$#*@!')

In [None]:
# Exercise 1
pattern = r"^-?[\d\.?,e]*$"
my_regex = re.compile(pattern)
assert my_regex.search('3.14529')
assert my_regex.search('-255.34')
assert my_regex.search('128')
assert my_regex.search('1.9e10')
assert my_regex.search('123,340.00')
assert not my_regex.search('720p')

In [None]:
# Exercise 2
pattern = r"^(1 )?\(?(\d{3})\)?\s?[\d -]*$"
my_regex = re.compile(pattern)
assert '415' in my_regex.findall('415-555-1234')[0]
assert '650' in my_regex.findall('650-555-2345')[0]
assert '416' in my_regex.findall('(416)555-3456')[0]
assert '202' in my_regex.findall('202 555 3456')[0]
assert '403' in my_regex.findall('4035555678')[0]
assert '416' in my_regex.findall('1 416 555 9292')[0]

In [None]:
# Exercise 3
pattern = r"^(\w+(\.?\w+))"
my_regex = re.compile(pattern)
assert 'tom' in my_regex.findall('tom@hogwarts.com')[0]
assert 'tom.riddle' in my_regex.findall('tom.riddle@hogwarts.com')[0]
assert 'tom.riddle' in my_regex.findall('tom.riddle+regexone@hogwarts.com')[0]
assert 'tom' in my_regex.findall('tom@hogwarts.eu.com')[0]
assert 'potter' in my_regex.findall('potter@hogwarts.com')[0]
assert 'harry' in my_regex.findall('harry@hogwarts.com')[0]
assert 'hermione' in my_regex.findall('hermione+regexone@hogwarts.com')[0]

In [None]:
# Exercise 4
pattern = r"^<(\w+)"
my_regex = re.compile(pattern)
assert 'a' in my_regex.findall('<a>This is a link</a>')[0]
assert 'a' in my_regex.findall("<a href='https://regexone.com'>Link</a>")[0]
assert 'div' in my_regex.findall("<div class='test_style'>Test</div>")[0]
assert 'div' in my_regex.findall('<div>Hello <span>world</span></div>')[0]

In [None]:
# Exercise 5
pattern = r"(\w+).(jpg|png|gif)$"
my_regex = re.compile(pattern)
assert not my_regex.findall('.bash_profile')
assert not my_regex.findall('workspace.doc')
assert 'img0912' in my_regex.findall('img0912.jpg')[0]
assert 'jpg' in my_regex.findall("img0912.jpg")[0]
assert 'updated_img0912' in my_regex.findall("updated_img0912.png")[0]
assert 'png' in my_regex.findall('updated_img0912.png')[0]
assert not my_regex.findall('documentation.html')
assert 'favicon' in my_regex.findall('favicon.gif')[0]
assert 'gif' in my_regex.findall('favicon.gif')[0]
assert not my_regex.findall('img0912.jpg.tmp')
assert not my_regex.findall('access.lock')

In [None]:
# Exercise 6
pattern = r"^\s*(.*)\s*$"
my_regex = re.compile(pattern)
assert 'The quick brown fox...' in my_regex.findall('				The quick brown fox...')[0]
assert 'jumps over the lazy dog.' in my_regex.findall(" jumps over the lazy dog.")[0]

In [None]:
# Exercise 7
pattern = r"(\w+)\(([\w\.]+):(\d+)"
my_regex = re.compile(pattern)
assert not my_regex.findall('	W/dalvikvm( 1553): threadid=1: uncaught exception')
assert not my_regex.findall('	E/( 1553): FATAL EXCEPTION: main')
assert not my_regex.findall('E/( 1553): java.lang.StringIndexOutOfBoundsException')
assert all (x in my_regex.findall('E/( 1553):   at widget.List.makeView(ListView.java:1727)')[0] for x in ('makeView', 'ListView.java', '1727'))
assert all (x in my_regex.findall('E/( 1553):   at widget.List.fillDown(ListView.java:652)')[0] for x in ('fillDown', 'ListView.java', '652'))
assert all (x in my_regex.findall('E/( 1553):   at widget.List.fillFrom(ListView.java:709)')[0] for x in ('fillFrom', 'ListView.java', '709'))

In [None]:
# Exercise 8
pattern = r"(\w+)://([\w+\.-]*)(:(\d+))?"
my_regex = re.compile(pattern)
assert all (x in my_regex.findall('ftp://file_server.com:21/top_secret/life_changing_plans.pdf')[0] for x in ('ftp', 'file_server.com', '21'))
assert all (x in my_regex.findall('https://regexone.com/lesson/introduction#section')[0] for x in ('https', 'regexone.com'))
assert all (x in my_regex.findall('	file://localhost:4040/zip_file')[0] for x in ('file', 'localhost', '4040'))
assert all (x in my_regex.findall('https://s3cur3-server.com:9999/')[0] for x in ('https', 's3cur3-server.com', '9999'))
assert all (x in my_regex.findall('market://search/angry%20birds')[0] for x in ('market', 'search'))