## Lab 8 (March 21):
### Strings

Course page: https://ambujtewari.github.io/stats306-winter2022/

Lab page: https://bosafoagyare.netlify.app/courses/stats306-w22/
 <br> <br>
Special thanks to **Ryan Duncan**, whose material has been very helpful in preparing this lab

  Today, we are going to look at:   
 - [Strings]()
 - [Regular Expressions]()

 






<br><br>
> ## Let's start by loading our packages. It is always encouraged to load all packages atop your codes.

In [None]:
library(tidyverse)       
options(repr.plot.width=10, repr.plot.height=8)    ## Set the dimension of all plots 

---

<br> <br>


## **(1) Regular Expressions**
Regular expressions (regex) are a way to describe patterns in text and are used to search for and match certain patterns in strings. For instance, say that you want to find and extract all the email addresses in a document automatically. How might we do that?

Let us look at some tools that we will employ in finding patterns in strings:
<br><br><br><br>
### **(1.1) Special characters**
- `. \ | ( ) [ ] ^ $ { } * + ?`
- Note that `\t` and `\n` match the tab and newline characters.
- If you want the "literal" versions of any of the reserved characters, you will need to escape them with a backslash `\`, e.g. `[\.\\\|]`


<br> 
### **(1.2) Character classes**
- `.` matches anything (wildcard)
- `[aeiou]` matches a single character in the set provided
- `[^aeiou]` matches a single character **NOT** in the set
- `[a-e]` matches a range, equivalent to `[abcde]`

<br> 
### **(1.3) Shorthand**
- `\w` matches a "word" character, equivalent to `[a-zA-Z0-9_]`, i.e alphanumeric plus underscore characters.
- `\s` matches any whitespace, including tabs and newlines
- `\d` matches digits, equivalent to `[0-9]`
- `\W`, `\S`, and `\D` match the opposite of the lower-case versions


<br> 
### **(1.4) Grouping**
- `()` are used to group patterns together. This can be used with any of the below operators. This can also be used to extract portions of a regex out individually, which we will later learn.
- `\1`, `\2`, etc. refers to the first, second, etc. group in the match.

**[NB:]** Groping is very important! A good time to use grouping is:
- for repetition (of same match)
- for providing alternatives
We will explore more in the examples below

<br> 
### **(1.5) Operators**
- `|` is the OR operator and allows matches of either side
- `{}` describes how many times the preceeding character of group must occur:
    + `{m}` must occur exactly `m` times
    + `{m,n}` must occur between `m` and `n` times, inclusive
    + `{m,}` Must occur at least `m` times
- `*` means the preceeding character can appear zero or more times, equivalent to `{0,}`
- `+` means the preceeding character must appear one or more times, equivalent to `{1,}`
- `?` means the preceeding character can appear zero or one time, equivalent to `{0,1}`

**- Challenge**: What do the following mean?  
1. `abc*` :
2. `abc+` :       
3. `abc?` :       
4. `abc{2}` : 
5. `abc{2,}` :
6. `abc{2,5}` :   
7. `a(bc)*` :
8. `a(bc){2,5}` : 
9. `a(b|c)` :

<br> 
### **(1.6) Anchors**
- `^` matches the start of a string (or line)
- `$` matches the end of a string (or line)
- `\b` matches a word "boundary"
- `\B` matches not word boundary

**- Challenge**: What do the following mean?  
1. `^The`  :
2.  `end$` :
3. `^The end$` : 
4. `pond`:


## **(2) Strings**

In [None]:
string1 = "Michigan: BIG 10 Champion!!"
string1

In [None]:
## Some states of the US 
our_state = "Michigan"
north_east_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")
lakemich_states = c("Wisconsin", "Illinois", "Michigan", "Indiana")

In [None]:
## Find our state (Michigan)
our_state %in% ne_states
our_state %in% lakemich_states

**String versus vector of strings**

In [1]:
## String vs vector of strings
string2 <- "umich"
str_vec <- c("berkeley", "harvard", "umich", "washington", "madison", "stanford")

In [None]:
## Gathering the appropriate colums to obtain a longer table
relig_income %>% 
  pivot_longer(cols = -religion, names_to = "income", values_to = "counts") %>%
  print()


### **(2.1) Some important string functions**
All functions in `stringr` start with `str_` and take a vector of strings as the first argument
- `str_detect(<STRING>, pattern)`: tells you if there’s any match to the pattern. Returns `TRUE/FALSE`
- `str_extract(<STRING>, pattern)`: extracts the text of the match. Think of it as extracting what was detected. This returns the first
pattern match found in each string, as a vector.
- `str_extract_all(<STRING>, pattern)`: returns **EVERY** pattern
match
- `str_subset(<STRING>, pattern)`: extracts the words containing the match (in a vector)
- `str_count(<STRING>, pattern)`: counts the number of patterns
- `str_length(<STRING>)`: computes the number of characters in a string
- `str_replace(<STRING>, pattern, replacement)` replaces the matches with new text

### **(2.2) Detect, Extract and Subset**
It is very important to know the diffreneces between `str_extract`, `str_extract_all` and `str_subset`. Let's look at the examples below:

What do you think is happening?

In [6]:
x <- c("why", "video", "cross", "extra", "deal", "authority")

## Detect
str_detect(x, "[aeiou]")

In [None]:
## Extract
str_extract(x, "[aeiou]")

In [None]:
## Extract all
str_extract_all(x, "[aeiou]")


In [None]:
## Subset
str_subset(x, "[aeiou]")

## **(3) Applications**

In [7]:
## Our string ("NOT VECTOR") to use


baseball = "According to Baseball Reference’s wins above average, The Red Sox had the best 
outfield in baseball— one-tenth of a win ahead of the Milwaukee Brewers, 11.5 to 11.4. And 
that’s despite, I’d argue, the two best position players in the NL this year (Christian 
Yelich and Lorenzo Cain) being Brewers outfielders. More importantly, the distance from 
Boston and Milwaukee to the third-place Yankees is about five wins. Two-thirds of the Los 
Angeles Angels’ outfield is Mike Trout (the best player in baseball) and Justin Upton (a 
four-time All-Star who hit 30 home runs and posted a 122 OPS+ and .348 wOba this year), 
and in order to get to 11.5 WAA, the Angels’ outfield would have had to replace right 
fielder Kole Calhoun with one of the three best outfielders in baseball this year by WAA."



**(i) Write a regex that captures all capitalized words**

In [None]:
## Your code here

**(ii) Write a regex that captures all hyphenated words**

In [None]:
## Your code here

**(iii) Write a regex that captures all words with two consecutive wovels**

In [None]:
## Your code here

**(iv) Write a regex that captures all words with a repeated letter**


In [None]:
## Your code here

**(v) Write a regex that captures all the numbers**

In [None]:
## Your code here

**(vi) Write a regex that matches this and the but not third**

In [None]:
## Your code here

<br>  
**[NB]:** Note that any time you want to use a backslash `\` in a regex pattern in R, you'll need to use a double backslash `\\` instead. This is because R has its own layer of string processing that also uses backslashes to escape reserved characters. So you need to tell R to use a literal backslash so that it passes a backslash to the regex function


In [8]:
## EXAMPLE
naive = "a.c"
dot = "a\\.c"

cat(naive)
str_detect(c("abc", "a.c", "bef"), naive) # matches anything a-blank-c because . is a wildcard

cat(dot)
str_detect(c("abc", "a.c", "bef"), dot)

a.c

a\.c

<br><br>
# **Exercise**
Complete the following task using the `stringr::words` dataset:




**(1) Start with `y`**

In [None]:
## Your code here

**(2) End with `x`** 

In [None]:
## Your code here

**(3) End with `ed`, but not with `eed`**

In [None]:
## Your code here

**(4)  End with `ing` or `ise`**

In [None]:
## Your code here

**(5) End with the same two-letter sequence they start with (e.g. `church`)**

In [None]:
## Your code here

**(6) A 5-letter pallindrome (e.g. `refer`)**

In [None]:
## Your code here

**(7) Match the date string below:**

In [None]:
dates = c('2012-05-13', '2014-12-31', '1991-06-14', '1991/06/14',
          '200a-05-13',  # invalid year
          '2014-15-20',  # invalid month
          '2014-00-20',  # invalid month
          '2016-04-35',  # invalid day
          '2014-12-00',  # invalid day
          '2013/03-25')  # non-matching separators

In [None]:
## Your code here