# Exercise: Strings

When merging your data from several data frames, sometimes the names of variables do not match or the row entries of your key variable are different, making it is necessary for you to finesse out of these character vectors. 

When you finish this exercise, you will learn how to:
1. handle characters and strings
2. use the tools in the **stringr** package
3. perform regular expression manipulation

Load packages in tidyverse.

In [1]:
# Load tidyverse
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## **stringr** package

In this exercise, you will be using the functions in the **stringr** package to handle and manipulate character vectors. To help you familiarize with these function, check out the [cheatsheet for **stringr**](https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf).

In [2]:
# Read the documentation of stringr package
help(package = "stringr")

Documentation for package ‘stringr’


		Information on package ‘stringr’

Description:

Package:            stringr
Title:              Simple, Consistent Wrappers for Common String
                    Operations
Version:            1.4.0
Authors@R:          c(person(given = "Hadley", family = "Wickham", role
                    = c("aut", "cre", "cph"), email =
                    "hadley@rstudio.com"), person(given = "RStudio",
                    role = c("cph", "fnd")))
Description:        A consistent, simple and easy to use set of
                    wrappers around the fantastic 'stringi' package.
                    All function and argument names (and positions) are
                    consistent, all functions deal with "NA"'s and zero
                    length vectors in the same way, and the output from
                    one function is easy to feed into the input of
                    another.
License:            GPL-2 | file LICENSE
URL:                http://stringr.

## String basics

You can create strings using either single quotes or double quotes. 

In [3]:
sentence_1 <- "This is a sentence inside double quotes."
sentence_1

In [4]:
sentence_2 <- 'This is a also sentence inside single quotes.'
sentence_2

Check the type of vector.

In [5]:
# What type of vector is "sentence_1" ?
typeof(sentence_1)

## Accessing individual characters 

Use the **`str_sub()`** function to access individual character. The **`str_sub()`** function takes in three arguments: (1) character vector, (2) start position, and (3) end position. The values of positions can either be a positive integer, which counts from the left, or a negative integer which counts from the right.

In [7]:
# Access the third letter of the first word in sentence_1
str_sub(sentence_1, start = 3, end = 3)

In [8]:
# Access the word "sentence" in sentence_2
str_sub(sentence_2, start = 16, end = 23)

You can also use **`str_sub()`** function to modify strings by assigning a character to it.

In [9]:
# Change tbe third and fourth letter of the first word in sentence_1 to "a" and "t", respectively.
str_sub(sentence_1, start = 3, end = 4) <- "at"
sentence_1

In the following codes below, you will play around with the **`words`** dataset in the **stringr** package. The **`words`** dataset contains commonly used words in the English language, which come from the **`rcorpora`** package written by Gabor Csardi and the data was collected by Darius Kazemi that is available at https://github.com/dariusk/corpora.

In [10]:
# Assign the "words" dataset to an object: words
words <- stringr::words

In [11]:
# How many words are included in the dataset?
length(words)

In [12]:
# Check the first 5 words
print(words[1:5])

[1] "a"        "able"     "about"    "absolute" "accept"  


In [13]:
# Subset randomly 10 items from the list using the sample() function
set.seed(21) # random number generator
some_words <- sample(words, 10)
some_words

Note that you are randomly sampling 10 words from the dataset, thus, your 10 words will change every time you run the cell above. If you do not want your word list to change each time you run the cell, you can use the **`set.seed( )`** function to generate a random number. The values are integers, so you can change the number, e.g. your favorite or lucky number and see the word list that you obtain. 

## Changing cases

Use the **`str_to_upper( )`** function to change all letters to uppercase. Conversely, use the **`str_to_lower( )`** function to change all letters to lowercase. If you want to capitalize the first character in a word, then use the **`str_to_title( )`** function.

In [14]:
# Change to big letters
str_to_upper(some_words)

In [15]:
# Capitalize the first letter only
str_to_title(some_words)

## String ordering and sorting

Use the **`str_order( )`** function to return a series of integers indicating the alphabetical order of elements in the character vector.

In [16]:
# Determine the alphabetical order 
str_order(some_words)

Use the **`str_sort( )`** function to rearrange the words in alphabetical order.

In [17]:
# Sort the words alphabetically
str_sort(some_words)

## Pattern matching

Most functions in the **`stringr`** package deal with patterns. It could be a string of letters that start with a capital letter for names of person or a series of numbers separated by hyphens or dashes **`-`** in between such as a telephone numbers. The pattern matching functions in the **`stringr`** package have the first two arguments: (1) character vector of strings to process and (2) a single pattern to match. 

Use the **`str_detect( )`** function to determine if a character vector matches a pattern. It will return a logical vector with `TRUE` if a match is found or `FALSE` if a match is not found.

In [24]:
# Are there words in my list that contain a letter "e"?
some_words
str_detect(some_words, "e")

The **`str_count( )`** function wil return the number of matches. 

In [25]:
# Count the number of times the letter "o" appear in each word
str_count(some_words, "o")

Use the **`str_subset( )`** to return the elements of a character vector that match a regular expression.

In [27]:
# Subset words containing letter "e"
str_subset(some_words, "e")

The **`str_locate( )`** function identifies the first position of a pattern and returns a numeric matrix with columns start and end. The **`str_locate_all( )`** function identifies all matches and returns a list of numeric matrices.

In [28]:
str_locate(some_words, "e")

start,end
4.0,4.0
5.0,5.0
7.0,7.0
5.0,5.0
6.0,6.0
,
4.0,4.0
,
,
,


In [29]:
str_locate_all(some_words, "e")

start,end
4,4

start,end
5,5

start,end
7,7

start,end
5,5

start,end
6,6

start,end

start,end
4,4

start,end

start,end

start,end


Use the **`str_extract( )`** function to parse characters corresponding to the first match, which will return a character vector for the pattern and `NA` if a match is not found. Use the **`str_extract_all( )`** function to parse all matches and return a list of character vectors.

In [32]:
some_words
str_extract(some_words, "in")

In [33]:
str_extract_all(some_words, "in")

Use the **`str_replace( )`** function to replace the first matched pattern, which returns a character vector. Use the **`str_replace_all( )`** to replace all matches.

In [36]:
# Change small "o" to big "O"
some_words
str_replace(some_words, "o", "O")

In [37]:
str_replace_all(some_words, "o", "O")

Note that the **`str_replace_all( )`** function changes all matched characters in every word in the list while the **`str_replace( )`** function changes only the first character that matched in each word.

Use the **`str_c( )`** function to combine two or more strings. The collapse argument is used as a separator between strings.

In [38]:
str_c(some_words, collapse = "-")

Use the **`str_split( )`** function to split a string into a variable number of pieces, which returns a list of character vectors.

In [40]:
str_c(some_words, collapse = "-") %>%
    str_split("-")

## Regular expression (REGEX)

Regular expressions allow flexible syntax when describing patterns in strings. Regular expressions are the default pattern engine in **`stringr`**. Each time you use a pattern matching function, it is actually wrapping it inside the **`regex( )`** function.

In [42]:
some_words
# The two lines of code below are the same 
str_extract(some_words, "in")
str_extract(some_words, regex("in"))

Regular expression use values symbols to represent a number, letter, etc. These different symbols are described below. Here, you will be playing around with the **`sentences`** dataset in the **`stringr`** package, which is a collection of "Harvard sentences" used for standardized testing of voice.

In [46]:
# Assign the sentences dataset to an object named, "sentences"
sentences <- stringr::sentences

In [49]:
# How many sentences are there in the sentence dataset
length(sentences)

In [48]:
# Check the first 5 sentences
print(sentences[1:5])

[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       


In [51]:
# Subset randomly 10 sentences using the sample() function
set.seed(43) # random number generator
some_sentences <- sample(sentences, 10)
print(some_sentences)

 [1] "The wide road shimmered in the hot sun."           
 [2] "The heap of fallen leaves was set on fire."        
 [3] "He sent the boy on a short errand."                
 [4] "Mud was spattered on the front of his white shirt."
 [5] "Cars and busses stalled in snow drifts."           
 [6] "We need an end of all such matter."                
 [7] "The marsh will freeze when cold enough."           
 [8] "The box was thrown beside the parked truck."       
 [9] "The office paint was a dull sad tan."              
[10] "Dimes showered down from all sides."               


### Match everything with a period **`.`**

Use the period **`.`** to match any character except a newline.

In [53]:
# Search for the pattern: "any letter" + "in" + "any letter"
str_extract(some_sentences, ".in.")

### Escaping characters

Metacharacters, such as period **`.`** asterisk **`*`** plus sign **`+`**, can be used as literal characters by escaping them using two backslashes **`\\`**. 

In [54]:
# Check if sentences end with a period.
str_detect(some_sentences, "\\.")

### Matching Multiple characters

There are a number of patterns that match more than one character. There are five escaped operators that match specific classes of characters.

The **`\d`** operator matches any digit. The complement, **`\D`**, matches any character that is not a decimal digit.

In [61]:
# Extract the age and month
str_extract_all("4y5m", "\\d")[[1]]

The **`\s`** operator matches any whitespace. This includes tabs, newlines, and form feeds.

In [64]:
# Replace all whitespaces with "_"
str_replace_all(some_sentences, "\\s", "_")

The **`\w`** matches any “word” character, which includes alphabetic characters, marks and decimal numbers.

In [73]:
# Extract all letters in each sentence
str_extract_all(some_sentences, "\\w")

### Character classes

Use **`[ ]`** to create your own character classes.

**`[abc]`** matches a, b, or c.\
**`[a-z]`** matches every character between a and z.\
**`[^abc]`** matches anything except a, b, or c.\
**`[\$\+]`** matches $ or +

You can also use the pre-built character classes for pattern matching.

**`[:alpha:]`** matches letters.\
**`[:lower:]`** matches lowercase letters.\
**`[:upper:]`** matches upperclass letters.\
**`[:digit:]`** matches digits.\
**`[:alnum:]`** matches letters and numbers.\
**`[:punct:]`** matches punctuation.\
**`[:graph:]`** matches letters, numbers, and punctuation.\
**`[:print:]`** matches letters, numbers, punctuation, and whitespace.\
**`[:space:]`** matches space characters (equivalent to \s).\
**`[:blank:]`** matches space and tab.

Use the alternation operator **`|`**  to select between one or more possible matches. For example, `abc|def` will match `abc` or `def`.

In [79]:
# Check which sentences contain indefinite articles: "a" or "an".
print(some_sentences)
str_detect(some_sentences, "\\sa\\s|\\san\\s")

 [1] "The wide road shimmered in the hot sun."           
 [2] "The heap of fallen leaves was set on fire."        
 [3] "He sent the boy on a short errand."                
 [4] "Mud was spattered on the front of his white shirt."
 [5] "Cars and busses stalled in snow drifts."           
 [6] "We need an end of all such matter."                
 [7] "The marsh will freeze when cold enough."           
 [8] "The box was thrown beside the parked truck."       
 [9] "The office paint was a dull sad tan."              
[10] "Dimes showered down from all sides."               


### Grouping

Use the parenthesis **`( )`** to define “groups” in your regular expression. Grouping allows parsing of values within the defined group.

In [86]:
# Extract all the definite articles, "the" in each sentence including at the beginning of the sentence, "The".
str_extract_all(some_sentences, "(T|t)he")

### Anchors

Use anchors to matche from the start or end of the string.

**`^`** matches the start of string.\
**`$`** matches the end of the string.

In [103]:
# Extract sentences that begin with "The"
str_extract_all(some_sentences, "^The")

In [104]:
# Extract sentences that end with letter "s".
# Do not forger to escape the period, "."
str_extract_all(some_sentences, "s\\.$")

### Repetition

You can control how many times a pattern matches with the repetition operators.\
**`?`** matches 0 or 1.\
**`+`** matches  1 or more.\
**`*`** matches 0 or more.

You can also specify the number of matches precisely using the curly braces **`{ }`**:\
**`{n}`** matches exactly n.\
**`{n,}`** matches n or more.\
**`{n,m}`** between n and m.

## Revisiting the data merging exercise

You will perform again data merging of the population data and the cumulative number of confirmed COVID-19 cases. Using the **`stringr`** functions, you will:
1. Modify the names of the variables so that the columns names will match in both datasets.
2. Modify the names of the countries so that they will match during merging.
3. Merge the two datasets.
4. Generate a graph of the number of population vs. the cumulative number of confirmed COVID-19 cases.

In [None]:
# Write your code below
