<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 8: Regular Expressions and Strings
## March 22nd, 2022

In [1]:
require(tidyverse)
require(stringr)

Loading required package: tidyverse

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.8
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# 1. Regular Expressions

Regular expressions (regex) are a way of describing **patterns** in text. In practice, they are used to search for (and/or replace) substrings. They can be trickyL

`Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.` - Jamie Zawinski

As a concrete example, suppose that we want to find and extract all the email addresses in a document. Do we manually search for all email addresses? Or do email addresses follow an abstract pattern that we can capture automatically?

## 1.1 Special characters

Regex takes advantage of several reserved characters that are used for special functions. 

`. \ | ( ) [ ] ^ $ { } * + ?`

We'll go through a bunch of definitions. These will make more sense with some concrete examples later on. For more practice, you can check out https://regexone.com/.

### Character classes

* `.` matches anything (wildcard)
* `[aeiou]` matches a single character in the set provided
* `[^aeiou]` matches a single character NOT in the set
* `[a-e]` matches a single character in a range, equivalent to `[abcde]`

#### Shorthand

* `\w` matches a "word" character, equivalent to `[a-zA-Z0-9]`
* `\s` matches any whitespace, including tabs and newlines
* `\d` matches digits, equivalent to `[0-9]`
* `\W`, `\S`, and `\D` match the opposite of the lower-case versions

#### Special characters

* Note that `\t` and `\n` match the tab and newline characters. 
* If you want the "literal" versions of any of the reserved characters, (e.g., if we want to match the period ".") you will need to escape them with a backslash `\`, e.g. `[\.\\\|]`


### Grouping

* `()` are used to group patterns together. This can be used with any of the below operators. This can also be used to extract portions of a regex out individually, which we will later learn.
* `\1`, `\2`, etc. refers to the first, second, etc. group in the match.

### Operators

* `|` is the OR operator and allows matches of either side
* `{}` describes how many times the preceeding character of group must occur:
  * `{m}` must occur exactly `m` times
  * `{m,n}` must occur between `m` and `n` times, inclusive
  * `{m,}` Must occur at least `m` times
* `*` means the preceeding character can appear zero or more times, equivalent to `{0,}`
* `+` means the preceeding character must appear one or more times, equivalent to `{1,}`
* `?` means the preceeding character can appear zero or one time, equivalent to `{0,1}`

### Anchors

* `^` matches the start of a string (or line)
* `$` matches the end of a string (or line)
* `\b` matches a word "boundary"
* `\B` matches not word boundary

# 2. Handling Strings in R

You're already familiar with the basics of strings. Note that there are some special characters. The most commonly used ones are `\n` and `\t` for newlines and tabs, respectively.

Also note that there are some reserved characters do special things in strings. If you want to include them, you must escape them with a backslash `\`.

In [4]:
double_quote = "hi\"bye"
backslash_ex = "a\\tb"
backslash_ex2 = "a\tb"

Running `print(double_quote)` shows the unformatted string--we'll want to use the `cat` function instead. 'cat' means "concatenate and print."

In [None]:
cat(double_quote)

hi"bye

In [None]:
cat(backslash_ex)

a\tb

In [None]:
cat(backslash_ex2)

a	b

You’ll also sometimes see strings like `"\u00b5"`($\mu$), this is called Unicode-escaping, and is a way of writing non-ASCII characters that works on all platforms.

In [None]:
cat("\u00b5")

µ

In [None]:
cat("\u00e7 (c-cedilla) is a Latin script letter, used in the Albanian, Azerbaijani, Manx, Tatar, Turkish, Turkmen, Kurdish, Zazaki, and Romance alphabets." )


ç (c-cedilla) is a Latin script letter, used in the Albanian, Azerbaijani, Manx, Tatar, Turkish, Turkmen, Kurdish, Zazaki, and Romance alphabets.

In [None]:
cat("You can even use emojis like: \U0001f637")

You can even use emojis like: 😷

### String Functions

In [10]:
ne_states <- c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")

In [17]:
# measure the lengths of strings within a chr vector
str_length(ne_states)

In [15]:
# string analog of `c`
str_c("lions", "tigers", "bears", "oh my!")

In [18]:
cat(str_c('Istanbul', 'Turkey\n', sep=', '))
cat(str_c('Ann Arbor', 'MI', "USA", sep=', '))

Istanbul, Turkey
Ann Arbor, MI, USA

In [21]:
# what happens when we combine str_c with c?
vec <- c("a", "b", "c")
str_c("d", vec)

In [22]:
x = c('abc', '123', NA)
str_c('|-', x, '-|')

In [24]:
str_c('|-', str_replace_na(x, "UNK"), '-|') # finds NA and replaces with 'UNK'

To collapse a vector of strings, use the `collapse` argument to `str_c`:

In [25]:
str_c(ne_states, collapse=", ")

### Subsetting Strings

In [None]:
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")
ne_states

In [29]:
# selects first 3 characters in each
str_sub(ne_states, 1, 3)

In [None]:
# selects last 3 characters
str_sub(ne_states, -3, -1)

In [30]:
str_sub(ne_states, 1, 7)  # Maine is 5 letters, but this worked still

In [31]:
# select and mutate substrings
str_sub(ne_states, 1, 1) <- str_to_lower(str_sub(ne_states, 1, 1))
# this alters the original string!
ne_states

In [None]:
str_sub(ne_states, -3, -1) <- str_to_upper(str_sub(ne_states, -3, -1))
ne_states

### String Replacement
We have seen this in previous labs, but a quick review...

In [34]:
str_replace("dragonfly", "fly", "")

In [35]:
str_replace_all("banana", "a", "o")

# 3. RegEx in R

In `R`, we will use  `str_detect` and `str_extract` (or `str_extract_all`) to play wtih regular expressions.

In [38]:
x = c("apple", "banana", "pear", "orange")

In [39]:
str_detect(x, "an")

In [40]:
str_extract(x, "an")

In [37]:
baseball = "According to Baseball Reference’s wins above average, The Red Sox had the best 
outfield in baseball— one-tenth of a win ahead of the Milwaukee Brewers, 11.5 to 11.4. And 
that’s despite, I’d argue, the two best position players in the NL this year (Christian 
Yelich and Lorenzo Cain) being Brewers outfielders. More importantly, the distance from 
Boston and Milwaukee to the third-place Yankees is about five wins. Two-thirds of the Los 
Angeles Angels’ outfield is Mike Trout (the best player in baseball) and Justin Upton (a 
four-time All-Star who hit 30 home runs and posted a 122 OPS+ and .348 wOba this year), 
and in order to get to 11.5 WAA, the Angels’ outfield would have had to replace right 
fielder Kole Calhoun with one of the three best outfielders in baseball this year by WAA."

#### 1 Write a regex that captures all capitalized words.

In [43]:
str_extract_all(baseball, "\\b[A-Z][a-z]+") #1 Write a regex that captures all capitalized words.
# think for a second: why do we need two backslashes to begin with?

Breaking down the above Regex:
- `\b` looks for a word boundary (not just the beginning of the text snippet!)
- `[A-Z]` matches a single capitalized letter
- `[a-z]` matches a single lowercase letter
- `+` means we match arbitrarily many lowercase letters

#### 2 Write a regex that captures all the numbers

In [45]:
str_extract_all(baseball, "\\.?\\d+\\.?\\d*") 
# exercise: break down what the component parts of this Regex are doing

#### 3 Write a regex that captures all hyphenated words

In [46]:
str_extract_all(baseball, "\\w+-\\w+") # \w stands for an arbitrary letter

#### 4 Write a regex that captures all words with two consecutive wovels

In [None]:
str_extract_all(baseball, "\\w*[aeiou]{2}\\w*")

#### 5 Write a regex that captures all words with a repeated letter

In [56]:
str_extract_all(baseball, "\\w*([a-zA-Z])\\1\\w*")
# equivalently: str_extract_all(baseball, "\\w*([\\w])\\1\\w*")
# the \1 is a backreference that matches the (first and only) () group

#### 6 Write a regex that matches "this" and "the" but not "third"

In [None]:
str_extract_all(baseball, "th(e|is)")
str_extract_all(baseball, "(t|T)h(e|is)") # including capitalized T

Note that any time you want to use a backslash `\` in a regex pattern in `R`, you'll need to use a double backslash `\\` instead. This is because `R` has its own layer of string processing that also uses backslashes to escape reserved characters. So you need to tell `R` to use a literal backslash so that it passes a backslash to the regex function.

In [None]:
naive = "a.c"
dot = "a\\.c"

cat(naive)
str_detect(c("abc", "a.c", "bef"), naive) # matches anything a-blank-c because . is a wildcard

cat(dot)
str_detect(c("abc", "a.c", "bef"), dot)

a.c

a\.c

Question: How many backslashes do you need to create a regex pattern that matches a literal backslash when using `R`?

In [57]:
x = "a\\b"
cat(x)

a\b

In [68]:
str_extract(x, "\\\\")
# remember, the parser interprets each "\\" as '\'.
# hence the raw string "\\\\" becomes the regex "\\" which matches "\" twice.

## Exercises

Use `stringr::words` to do the exercises

In [None]:
words

### 1. Which words start with `y`? (Freebie)

In [None]:
str_extract(words, "^y\\w*")

In [None]:
na.omit(str_extract(words, "^y\\w*"))

### 2. Which words end with `x`?

### 3. ...are exactly two letters long (don’t use `str_length` here)?

### 4. ...have ten or more letters?

### 5. ...end with `ed`, but not with `eed`?

### 6. ...end with `ing` or `ise`?

### 7. ...end with the same two-letter sequence they start with (e.g. `church`)?

### 8 Try to match the valid `dates` below (first row) without matching the invalid dates (the latter six rows).
Hint: Start by writing a pattern that matches all the entries. Then try to refine your pattern to omit the invalid dates.

In [None]:
dates = c('2012-05-13', '2014-12-31', '1991-06-14', '1991/06/14',
          '200a-05-13',  # invalid year
          '2014-15-20',  # invalid month
          '2014-00-20',  # invalid month
          '2016-04-35',  # invalid day
          '2014-12-00',  # invalid day
          '2013/03-25')  # non-matching separators