In [2]:
library(tidyverse)
library(stringr)
# remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


## 🤔 Quiz

This commands loads a list of 25,000 of the most popular words in the English language:

In [None]:
pop_words <- scan(url('https://raw.githubusercontent.com/dolph/dictionary/master/popular.txt'), character())

How many words *contain* `qu` but don't *start* with `qu`? (Hint: `str_sub` and `str_detect`):

<ol style="list-style-type: upper-alpha;">
    <li>9.3</li>
    <li>93</li>
    <li>193</li>
    <li>293</li>
    <li>393</li>
</ol>


In [37]:
# contain qu not start with qu

# Lecture 13: Regular expressions

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand basic regular expressions.
* Use regular expressions to extract data from text.
</div>

These notes correspond to Chapter 16 of your book.


## Regular expressions
Regular expressions (regex, regexps) are a programming language that allows you to describe patterns in strings. They have a steep learning curve but are very powerful for working with text data. In this class we will just focus on the basics of regexps. A good tool for learning regexps is [regex101](https://regex101.com/), which lets you interactively edit and debug your regular expressions.

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. 
>
> — Jamie Zawinski (famous nerd)

In these slides we will use the command `str_view` to understand how regular expressions work. To get `str_view` working on Colab you need to install the latest versions of `tidyverse` and `htmlwidgets`:

In [45]:
#  install.packages(c("tidyverse", "htmlwidgets")) if running on colab

The most basic regular expression is a plain string. It will match if the other string contains it as a substring.

In [13]:
x = c("apple", "banana", "pear") %>% print
str_view(x, pattern = "an")

[1] "apple"  "banana" "pear"  


[90m[2] │[39m b[36m<an>[39m[36m<an>[39ma

Here `str_view` has matched our regexp (`"an"`) inside of the second string `banana` of the vector `x`.

In [11]:
fruit

In [12]:
str_view(fruit, 'berry')

[90m [6] │[39m bil[36m<berry>[39m
[90m [7] │[39m black[36m<berry>[39m
[90m[10] │[39m blue[36m<berry>[39m
[90m[11] │[39m boysen[36m<berry>[39m
[90m[19] │[39m cloud[36m<berry>[39m
[90m[21] │[39m cran[36m<berry>[39m
[90m[29] │[39m elder[36m<berry>[39m
[90m[32] │[39m goji [36m<berry>[39m
[90m[33] │[39m goose[36m<berry>[39m
[90m[38] │[39m huckle[36m<berry>[39m
[90m[50] │[39m mul[36m<berry>[39m
[90m[70] │[39m rasp[36m<berry>[39m
[90m[73] │[39m salal [36m<berry>[39m
[90m[76] │[39m straw[36m<berry>[39m

### Wildcards
Our first non-trivial regular expression will use a wildcard: `.`. Used inside of a regular expression, the period matches any single character:

In [3]:
str_view("else every eele etcetera", "e..e") 

[90m[1] │[39m [36m<else>[39m every [36m<eele>[39m [36m<etce>[39mtera

If we want to "extract" the first match we can use `str_extract()` instead:

In [19]:
str_extract("else every eele etcetera", "e..e ") 

Now let's return to another example from last lecture: finding all capitalized words.

### Exercise
What is the first string that matches the pattern `H<any three characters>y` in Philosopher's Stone?

In [11]:
# H<3>y

### Character classes
Instead of matching anything using `.`, we often want to match a class of things: words, numbers, spaces, etc.
A "character class" is a special pattern that matches a collection of characters. There are four built-in character classes you should know:
- `\w`: match any word-like character.
- `\s`: match any whitespace character.
- `\d`: match any digit.
- `\b`: match a "word boundary" (more on this in a moment).



`\w` matches any word character:

In [31]:
str_view("this is a word", "\\w")

[90m[1] │[39m [36m<t>[39m[36m<h>[39m[36m<i>[39m[36m<s>[39m [36m<i>[39m[36m<s>[39m [36m<a>[39m [36m<w>[39m[36m<o>[39m[36m<r>[39m[36m<d>[39m

Note the additional level of escaping needed here: "\\w" gets parsed by R into the string `\w`:

In [46]:
writeLines("\\w")

\w


The `\w` is then parsed again by the regular expression.

The string "\w" is not valid in R, because there is no escape code "\w":

```
> "\w"
Error: '\w' is an unrecognized escape in character string starting ""\w"
Traceback:
```

`\d` will match any digit:

In [32]:
str_view(c("number1", "two", "3hree"), "\\d")

[90m[1] │[39m number[36m<1>[39m
[90m[3] │[39m [36m<3>[39mhree

Similarly, `\s` will match whitespace (spaces, tabs and newlines):

In [41]:
y = c("spa ce", "hello\tworld", "multi\nline")
writeLines(y)
str_view(y, "\\s")

spa ce
hello	world
multi
line


[90m[1] │[39m spa[36m< >[39mce
[90m[2] │[39m hello[36m<[36m{\t}[39m>[39mworld
[90m[3] │[39m multi[36m<[39m
    [90m│[39m [36m>[39mline

You can also create your own charecter class using square brackets: `[abc]` will match *one of* `a`, `b`, or `c`. In other words, the 'width' of a character class is one character by default.

In [33]:
str_view(fruit, '[be]a')  # Match either 'b' or 'e' followed by a

[90m [4] │[39m [36m<ba>[39mnana
[90m[12] │[39m br[36m<ea>[39mdfruit
[90m[58] │[39m p[36m<ea>[39mch
[90m[59] │[39m p[36m<ea>[39mr
[90m[62] │[39m pin[36m<ea>[39mpple

We can use character classes to match the first capital letter of a capitalized word:

In [42]:
str_view(c("These", "are", "some Capitalized words"),
         "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]")

[90m[1] │[39m [36m<T>[39mhese
[90m[3] │[39m some [36m<C>[39mapitalized words

We do not need to go to all the trouble of typing each capital letter. We can use the shortcut `[A-Z]` instead.

In [43]:
str_view(c("These", "are", "some Capitalized words"), "[A-Z]")

[90m[1] │[39m [36m<T>[39mhese
[90m[3] │[39m some [36m<C>[39mapitalized words

### Word boundaries
A final character class we'll use frequently is `\b`, which stands for "word boundary". A word boundary matches the "edges" of a word:

In [64]:
str_view(c("Rafael Nadal", "Roger Federer", "Novak Djokovic"), "\\b")

[90m[1] │[39m [36m<>[39mRafael[36m<>[39m [36m<>[39mNadal[36m<>[39m
[90m[2] │[39m [36m<>[39mRoger[36m<>[39m [36m<>[39mFederer[36m<>[39m
[90m[3] │[39m [36m<>[39mNovak[36m<>[39m [36m<>[39mDjokovic[36m<>[39m

Every word has a word boundary on either side, so we can use this in combination with other character classes to match certain kinds of words in text.

## 🤔 Quiz

About how many 5-letter words are there in `ch1`?

<ol style="list-style-type: upper-alpha;">
    <li>Less than 100</li>
    <li>100-300</li>
    <li>300-600</li>
    <li>600 or more</li>
</ol>


In [62]:
# 5-letter words

In this exercise, we matched the pattern 

    <word boundary><five word characters><word boundary>.

## Quantifiers
Now we can return to a question that we asked in the previous lecture: how many words are there in `ch1`? We did a crude approximation by counting the number of spaces, but we saw that this double-counted certain words. A better way is to count how many times the following pattern matches:

    <word boundary><any number of word characters><word boundary>

The four quantifiers you should know are:
- `?`: match zero or one of the preceding character.
- `+`: match one or more of the preceding character.
- `*`: match zero or more of the preceding character.
- `{x}`: match exactly `x` of the preceding character.
    - `{x,y}`: match between `x` and `y` of the preceding character.
    - `{x,}`: match at least `x` of the preceding character.

So, to count the number of words using the pattern shown above:

In [46]:
# match any number of characters inside of word boundaries.

## 🤔 Quiz

How many words in `ch1` match the pattern:

    <word boundary><capital letter><at least one lowercase letter><word boundary>

(Example: `Harry.` matches but ` I ` does not.)

<ol style="list-style-type: upper-alpha;">
    <li>Less than 100</li>
    <li>100-300</li>
    <li>300-600</li>
    <li>600 or more</li>
</ol>


In [43]:
# cap words

Let's return to an example from last lecture: find all the character names in Harry Potter. By matching all words that start with capital letters, we're off to a good start. But we pick up the beginning word of any sentence, resulting in a lot of false matches. Let's use quantifiers to restrict to only longer words, say capitalized words that are at least six characters long.

Our pattern becomes:

    <word boundary><Capital letter><at least five lowercase letters><word boundary>

In [25]:
# all cap words that have at least six characters

## Grouping
In the previous exercise we found that "Professor" is one of the most common capitalized words. Is there a character named Professor, or is it just a title? Now let us try to match one or more capitalized words in a row. We can accomplish this by creating a *group*, and then applying a quantifier to it. 

To create a group, I surround a part of my regexp with parentheses:

In [None]:
str_view("this will be grouped", "[a-z]+ ?")
str_view("this will be grouped", "([a-z]+ ?)")

The parentheses do not change the regular expression (but they are doing something else, which we will discuss shortly.) But now I can apply a quantifier to the whole group:

In [None]:
str_view("this will be grouped", "([a-z]+ ?)+")

So now we take the previous pattern and group it:

    (<word boundary><Capital letter><at least five lowercase letters><word boundary>){match 1+ times}

In [57]:
# match one or more cap words in a row

## 🤔 Quiz

Besides Professor McGonagall, who is the other Professor mentioned in ch1?

<ol style="list-style-type: upper-alpha;">
    <li>Prof. Muggles</li>
    <li>Prof. Terhorst</li>
    <li>Prof. Dumbledore</li>
    <li>Prof. Dursley</li>
</ol>


In [65]:
# other professor

## Negations
Earlier we looked at quotations. The first quotation in chapter 1 is:

In [58]:
str_sub(ch1, 2150, 2163)

How can we find other quotes? The pattern for a quote is a quotation mark, followed by any number of things that are not a quotation mark, followed by another quotation mark:

    <quotation mark><anything that is not a quotation mark><quotation mark>


To match this, we will use a *negation*. A negation is a character class that begins with the character "^". It matches anything that in *not* inside the character class:

In [59]:
str_view_all("match doesn't match", "[^aeiou]+")

[90m[1] │[39m [36m<m>[39ma[36m<tch d>[39moe[36m<sn't m>[39ma[36m<tch>[39m

To match a quotation, we'll input the pattern that we specified above:

In [60]:
str_view_all('"Here is a quotation", said the professor. "And here is another."',
             '"[^"]+"')

[90m[1] │[39m [36m<"Here is a quotation">[39m, said the professor. [36m<"And here is another.">[39m

## 🤔 Quiz

How many quotations are there in Ch. 1? (Use the pattern shown above.)

<ol style="list-style-type: upper-alpha;">
    <li>Less than 50</li>
    <li>50-100</li>
    <li>100-150</li>
    <li>150-200</li>
</ol>


In [70]:
# number of quotes

## Backreferences
Parentheses define groups that can be referred to later in the match as `\1`, `\2` etc. This is called a backreference. For example:

    (.)\1

will match the same character repeated twice in a row:

In [26]:
"eel"  %>% str_view("(.)\\1", match = T)

[90m[1] │[39m [36m<ee>[39ml

## 🤔 Quiz

What does this regular expression match?:

```
(..).*\\1
```


<ol style="list-style-type: upper-alpha;">
    <li>Any word that starts and ends with the same character, e.g. `alpha`</li>
    <li>Any word that starts and ends with the same two characters, e.g. `church`</li>
    <li>Any word that ends with two characters that are found earlier in the string, e.g. `therefore`</li>
    <li>I hate regular expressions.</li>
</ol>

In [74]:
# what's it match?

## Anchors
Sometimes we want a match to occur at a particular position in the string. For example, "all words which start with b". For this we have the special anchor characters: `^` and `$`. The caret `^` matches the beginning of a string. The `$` matches the end.

In [76]:
x <- c('apple', 'banana', 'pear')
str_view(x, '^b')
str_view(x, 'r$')

## 🤔 Quiz

What does this regular expression match?:

```
^(.).*\\1$
```


<ol style="list-style-type: upper-alpha;">
    <li>Any word that starts and ends with the same character, e.g. `alpha`</li>
    <li>Any word that starts and ends with the same two characters, e.g. `church`</li>
    <li>Any word that ends with a character that is found earlier in the string, e.g. `therefore`</li></li>
    <li>I hate regular expressions.</li>
</ol>

In [76]:
# solution

In [None]:
### Challenge Question
This command loads the text of Hamlet:

In [59]:
hamlet <- readLines(url("http://erdani.com/tdpl/hamlet.txt"))
writeLines(hamlet[1:15])


1604


THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


by William Shakespeare



Dramatis Personae

  Claudius, King of Denmark.
  Marcellus, Officer.


In [None]:
### Exercise

Shakespeare used a lot of [poetic contractions](https://en.wikipedia.org/wiki/Poetic_contraction): `'tis`, `'twas`, `o'er`, etc.

**Beginner** Write a regexp that matches all such contractions

**Advanced** Print a frequency table of the top 10 most common.

In [67]:
str_extract_all(
        str_to_lower(hamlet), "[a-z]*'[a-z]+") %>% unlist %>% 
    print

   [1] "hamlet's"       "who's"          "'tis"           "'tis"          
   [5] "reliev'd"       "appear'd"       "'tis"           "'twill"        
   [9] "that's"         "that's"         "usurp'st"       "'tis"          
  [13] "on't"           "frown'd"        "'tis"           "is't"          
  [17] "appear'd"       "prick'd"        "dar'd"          "esteem'd"      
  [21] "seal'd"         "seiz'd"         "return'd"       "design'd"      
  [25] "shark'd"        "in't"           "e'en"           "mind's"        
  [29] "neptune's"      "i'll"           "country's"      "'tis"          
  [33] "'tis"           "'tis"           "'gainst"        "saviour's"     
  [37] "hallow'd"       "o'er"           "let's"          "do't"          
  [41] "brother's"      "'twere"         "barr'd"         "brother's"     
  [45] "fail'd"         "nephew's"       "what's"         "is't"          
  [49] "father's"       "seal'd"         "know'st"        "'tis"          
  [53] "'seems"         "

In [None]:
### `str_extract`
`str_extract(v, re)` extracts substring matched by `re` from each element of `v`. Another way to think of this is as returning the portion of the string which is highlighted by `str_view`:

In [91]:
q = 'Research is formalized curiosity. It is poking and prying with a purpose.'
# re to match capitalized words
# re = NA
# str_view(q, re)
# str_extract(q, re)

In [None]:
Analogous to `str_view_all` we have `str_extract_all`:

In [92]:
str_view_all(q, re)
str_extract_all(q, re)

In [None]:
### `str_match`
`str_match(v, re)` will create a matrix out of the grouped matches in `re`. The first column has the whole match, and additional columns are added for each character group. If the pattern does not match, you will get `NA`s.

In [47]:
head(str_match(words, '^(.).*(.)$'))

0,1,2
,,
able,a,e
about,a,t
absolute,a,e
accept,a,t
account,a,t


In [None]:
### `str_replace`
`str_replace(v, re, rep)` will replace each match of `re` in `v` with `rep`. The most basic usage is as a sort of find and replace:

In [50]:
str_replace('Give me liberty or give me death', '\\w+$', 'pizza')

In [None]:
A very useful feature of regexp replacements is the ability to use backreferences:

In [None]:
# Your code here