## Regex Review (credits to Roger Fan)

In [155]:
options(repr.plot.width=6, repr.plot.height=4)

require(tidyverse)
require(stringr)
require(lubridate)

## Regular Expressions Review

Regular expressions (regex) are a way to describe patterns in text and are used to search for and match certain patterns in strings.

`Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.` - Jamie Zawinski

For instance, say that you want to find and extract all the email addresses in a document automatically. How might we do that?

### Special characters

Regex takes advantage of several reserved characters that are used for special functions. 

`. \ | ( ) [ ] ^ $ { } * + ?`

### Character classes

* `.` matches anything (wildcard)
* `[aeiou]` matches a single character in the set provided
* `[^aeiou]` matches a single character NOT in the set
* `[a-e]` matches a range, equivalent to `[abcde]`

#### Shorthand

* `\w` matches a "word" character, equivalent to `[a-zA-Z0-9_]`
* `\s` matches any whitespace, including tabs and newlines
* `\d` matches digits, equivalent to `[0-9]`
* `\W`, `\S`, and `\D` match the opposite of the lower-case versions

#### Special characters

* Note that `\t` and `\n` match the tab and newline characters. 
* If you want the "literal" versions of any of the reserved characters, you will need to escape them with a backslash `\`, e.g. `[\.\\\|]`


### Grouping

* `()` are used to group patterns together. This can be used with any of the below operators. This can also be used to extract portions of a regex out individually, which we will later learn.
* `\1`, `\2`, etc. refers to the first, second, etc. group in the match.

### Operators

* `|` is the OR operator and allows matches of either side
* `{}` describes how many times the preceeding character of group must occur:
  * `{m}` must occur exactly `m` times
  * `{m,n}` must occur between `m` and `n` times, inclusive
  * `{m,}` Must occur at least `m` times
* `*` means the preceeding character can appear zero or more times, equivalent to `{0,}`
* `+` means the preceeding character must appear one or more times, equivalent to `{1,}`
* `?` means the preceeding character can appear zero or one time, equivalent to `{0,1}`

### Anchors

* `^` matches the start of a string (or line)
* `$` matches the end of a string (or line)
* `\b` matches a word "boundary"
* `\B` matches not word boundary

### String Functions

See `https://stringr.tidyverse.org/reference/index.html` for a more complete list of string functions and their documentation.

Recall that any functions that use the argument `pattern` in the documentation will by default assume the pattern provided is a regular expression. These include functions like `str_detect`, `str_replace`, `str_count`, etc.


In [129]:
ne_states = c('Connecticut', 'Maine', 'Massachusetts', 'Vermont', 'New Hampshire', 'Rhode Island')

str_length(ne_states)

In [130]:
str_c('Seoul', 'Korea', sep=', ')
# paste('Seoul', 'Korea', sep=', ')

In [132]:
x = c('abc', '123', NA)

str_c('|-', x, '-|')

In [133]:
str_c('|-', str_replace_na(x), '-|')

To collapse a vector of strings, use the `collapse` argument to `str_c`:

In [134]:
str_c(ne_states, collapse=", ")

### Subsetting Strings

In [136]:
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")

str_sub(ne_states, 1, 3)

In [71]:
str_sub(ne_states, -3, -1)

In [137]:
str_sub(ne_states, 1, 7)  # notice that this still works for Maine

In [73]:
str_sub(ne_states, 1, 1) = str_to_lower(str_sub(ne_states, 1, 1))
ne_states

In [74]:
str_sub(ne_states, -3, -1) = str_to_upper(str_sub(ne_states, -3, -1))
ne_states

In [156]:
str_replace_all('This is a sentence.', '([aeiouAEIOU])', '\\1\\1\\1')

In [157]:
str_replace_all('beauty obvious previous quiet serious various', '([aeiou])([aeiou])([aeiou])', '\\3\\2\\1')