# <div style="text-align: right"> Chapter __11__</div>

# __Strings with stringr__

In [2]:
# libraries
library(tidyverse)

# config
repr_html.tbl_df <- function(obj, ..., rows = 6) repr:::repr_html.data.frame(obj, ..., rows = rows)
options(dplyr.summarise.inform = FALSE)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.1
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## __String Length__

In [3]:
str_length(c('a', 'R for data science', NA))

### __Combining Strings__

In [4]:
str_c('x', 'y')

In [6]:
str_c('x', 'y', sep = ', ')

Like most other functions in R, missing values are contagious. If you
want them to print as "NA" , use `str_replace_na()` :

In [7]:
x <- c('abc', NA)
str_c('|-', x, '-|')

In [8]:
str_c('|-', str_replace_na(x), '-|')

As shown in the preceding code, `str_c()` is vectorized, and it auto‐
matically recycles shorter vectors to the same length as the longest:

In [9]:
str_c('prefix-', c('a', 'b', 'c'), '-suffix')

### __Subsetting Strings__

You can extract parts of a string using `str_sub()` . As well as the
string, `str_sub()` takes start and end arguments that give the
(inclusive) position of the substring:

In [11]:
x <- c('Apple', 'Banana', 'Pear')
str_sub(x, 1, 3)

In [12]:
str_sub(x, -3, -1)

Note that `str_sub()` won’t fail if the string is too short; it will just
return as much as possible:

In [13]:
str_sub('ab', 1, 5)

You can also use the assignment form of `str_sub()` to modify
strings:

In [15]:
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
(x)



What does `str_trim()` do? What’s the opposite of `str_trim()`?


In [28]:
?str_trim

0,1
str_trim {stringr},R Documentation

0,1
string,A character vector.
side,"Side on which to remove whitespace (left, right or both)."


In [29]:
str_trim('  12345  ')

In [30]:
str_trim('  12345  ', side = 'left')

In [34]:
str_pad('12345', 9)

In [35]:
str_pad('123', 5, side = 'both')

Write a function that turns (e.g.) a vector `c("a", "b", "c")`
into the string a, b, and c . Think carefully about what it
should do if given a vector of length 0, 1, or 2.

In [85]:
str_sep_comma <- function(vector) {
    if (length(vector) == 0) {
        return(vector)
    }
    else if (length(vector) == 1) {
        return(vector)
    }
    else if (length(vector) == 2) {
        return(str_c(vector[1], ' and ', vector[2]))
    }
    else {
        head <- str_c(vector[seq_len(length(vector) - 1)], ',')
        tail <- str_c("and", vector[[length(vector)]], sep = " ")
        return(str_c(c(head, tail), collapse = " "))
    }
}

In [86]:
str_sep_comma(c('a', 'c', 'e', 'f', 't', 'd'))

## __Matching Patterns with Regular Expressions__

Regexps are a very terse language that allow you to describe patterns
in strings. They take a little while to get your head around, but once
you understand them, you’ll find them extremely useful.
To learn regular expressions, we’ll use `str_view()` and
`str_view_all()` . These functions take a character vector and a regular expression, and show you how they match. We’ll start with very
simple regular expressions and then gradually get more and more
complicated. Once you’ve mastered pattern matching, you’ll learn
how to apply those ideas with various stringr functions.

In [18]:
x <- c('apple', 'banana', 'pear')
str_view(x, 'an')

The next step up in complexity is `.` , which matches any character
(except a newline):

In [19]:
str_view(x, '.a.')

But if `"."` matches any character, how do you match the character
`"."` ? You need to use an “escape” to tell the regular expression you
want to match it exactly, not use its special behavior. Like strings,
regexps use the backslash, `\` , to escape special behavior. So to match
an `.` , you need the regexp `\.` . Unfortunately this creates a problem.
We use strings to represent regular expressions, and `\` is also used as
an escape symbol in strings. So to create the regular expression `\`.
we need the string `"\\."` :

In [21]:
str_view(c('abc', 'a.c', 'bef'), 'a\\.c')

If `\` is used as an escape character in regular expressions, how do
you match a literal `\` ? Well you need to escape it, creating the regular
expression `\\` . To create that regular expression, you need to use a
string, which also needs to escape `\` . That means to match a literal `\`
you need to write `"\\\\"` —you need four backslashes to match one!

In [22]:
x <- 'a\\b'
writeLines(x)

a\b


In [23]:
str_view(x, '\\\\')

### __Anchors__

By default, regular expressions will match any part of a string. It’s
often useful to anchor the regular expression so that it matches from
the start or end of the string. You can use:

* `^` to match the start of the string.
* `$` to match the end of the string.

In [24]:
x <- c('apple', 'banana', 'pear')
str_view(x, '^a')

In [25]:
str_view(x, 'a$')

To force a regular expression to only match a complete string,
anchor it with both `^` and `$` :

In [26]:
x <- c('apple pie', 'apple', 'apple cake')
str_view(x, 'apple')

In [27]:
str_view(x, '^apple$')

## __Character Classes and Alternatives__

There are a number of special patterns that match more than one
character. You’ve already seen . , which matches any character apart
from a newline. There are four other useful tools:

* `\d` matches any digit.
* `\s` matches any whitespace (e.g., space, tab, newline).
* `[abc]` matches a, b, or c.
* `[^abc]` matches anything except a, b, or c.

You can use alternation to pick between one or more alternative pat‐
terns. For example, `abc|d..f` will match either "abc" , or "deaf" .
Note that the precedence for `|` is low, so that `abc|xyz` matches abc
or xyz not abcyz or abxyz . Like with mathematical expressions, if
precedence ever gets confusing, use parentheses to make it clear
what you want:

In [87]:
str_view(c('grey', 'gray'), 'gr(e|a)y')

Create regular expressions to find all words that:
* Start with a vowel.
* Only contain consonants. (Hint: think about matching
“not”-vowels.)
* End with ed , but not with eed .
* End with ing or ize .

In [95]:
x <- c('hola', 'adios', 'queso', 'amor amor', 'ether', 'thneed', 'studied')
z <- c('engineering', 'eating', 'cheese', 'categorize', 'hipnotize', 'hello')

In [96]:
str_view(x, '^[aeiou]')

In [97]:
str_view(x, '^[^(aeiou)]')

In [105]:
str_view(x, '[^e]ed$')

In [106]:
str_view(z, '(ing|ize)$')

### __Repetition__

The next step up in power involves controlling how many times a
pattern matches:

* `?`:0 or 1
* `+`: 1 or more
* `*`: 0 or more

In [107]:
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, 'CC?')

In [108]:
str_view(x, 'CC+')

In [109]:
str_view(x, 'C[LX]+')

In [110]:
str_view(x, '[CLX]+')

You can also specify the number of matches precisely:

* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

In [111]:
str_view(x, 'C{2}')

In [112]:
str_view(x, 'C{2,}')

In [113]:
str_view(x, 'C{2,3}')

By default these matches are “greedy”: they will match the longest
string possible. You can make them “lazy,” matching the shortest
string possible, by putting a ? after them. This is an advanced fea‐
ture of regular expressions, but it’s useful to know that it exists:

In [114]:
str_view(x, 'C{2,3}?')

In [115]:
str_view(x, 'C[LX]+?')

Write a regular expression that matches a word if it’s probably written in British English, not American English.

In the general case, this is hard, and could require a dictionary. But, there are a few heuristics to consider that would account for some common cases: British English tends to use the following:

* “ou” instead of “o”
* use of “ae” and “oe” instead of “a” and “o”
* ends in ise instead of ize
* ends in yse


## __Grouping and Backreferences__

Earlier, you learned about parentheses as a way to disambiguate
complex expressions. They also define “groups” that you can refer to
with backreferences, like \1 , \2 , etc. For example, the following regular expression finds all fruits that have a repeated pair of letters:

In [116]:
str_view(fruit, '(..)\\1', match = TRUE)

2. Construct regular expressions to match words that:

* Start and end with the same character.
* Contain a repeated pair of letters (e.g., “church” contains “ch”
repeated twice).
* Contain one letter repeated in at least three places (e.g.,
“eleven” contains three “e”s).

In [128]:
x <- c('aura', 'ethernet', 'meem', 'strings', 'america', 'example', 'ejempli')
str_view(x, '^(.)(.*)(\\1)$')

In [134]:
str_view(fruit, '([A-Za-z][A-Za-z]).*\\1', match = TRUE)

## __Detect Matches__

To determine if a character vector matches a pattern, use
`str_detect()` . It returns a logical vector the same length as the
input:

In [135]:
x <- c('apple', 'banana', 'pear')

In [136]:
str_detect(x, 'e')

Remember that when you use a logical vector in a numeric context,
`FALSE` becomes `0` and `TRUE` becomes `1`. That makes `sum()` and
`mean()` useful if you want to answer questions about matches across
a larger vector:

In [137]:
# how many common words start with t?
sum(str_detect(words, '^t'))

In [138]:
# what proportion of common words end with a vowel?
mean(str_detect(words, '[aeiou]$'))

In [139]:
mean(str_detect(words, '[^(aeiou)]$'))

When you have complex logical conditions (e.g., match a or b but
not c unless d) it’s often easier to combine multiple `str_detect()`
calls with logical operators, rather than trying to create a single reg‐
ular expression. For example, here are two ways to find all words
that don’t contain any vowels:

In [140]:
# find all words containing at least one vowel,
# and negate
no_vowels_1 <- !str_detect(words, '[aeiou]')
# find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, '^[^aeiou]+$')

identical(no_vowels_1, no_vowels_2)

The results are identical, but I think the first approach is signifi‐
cantly easier to understand. If your regular expression gets overly
complicated, try breaking it up into smaller pieces, giving each piece
a name, and then combining the pieces with logical operations.

A common use of `str_detect()` is to select the elements that match
a pattern. You can do this with logical subsetting, or the convenient
`str_subset()` wrapper:

In [143]:
words[str_detect(words, 'x$')]

In [144]:
str_subset(words, 'x$')

Typically, however, your strings will be one column of a data frame,
and you’ll want to use `filter` instead:

In [145]:
df <- tibble(
    word = words,
    i = seq_along(word)
)

In [146]:
df %>%
    filter(str_detect(words, 'x$'))

word,i
<chr>,<int>
box,108
sex,747
six,772
tax,841


A variation on `str_detect()` is `str_count()` : rather than a simple
yes or no, it tells you how many matches there are in a string:

In [147]:
x <- c('apple', 'banana', 'pear')
str_count(x, 'a')

In [149]:
# on average, how many vowels per word?
mean(str_count(words, '[aeiou]'))

In [150]:
# its natural to use str_count() with mutate()
df %>%
    mutate(vowels = str_count(word, '[aeiou]'),
           consonants = str_count(word, '[^aeiou]'))

word,i,vowels,consonants
<chr>,<int>,<int>,<int>
a,1,1,0
able,2,2,2
about,3,3,2
⋮,⋮,⋮,⋮
yet,978,1,2
you,979,2,1
young,980,2,3


Note that matches never overlap. For example, in "abababa" , how
many times will the pattern "aba" match? Regular expressions say
two, not three:

In [151]:
str_count('abababa', 'aba')

In [152]:
str_view_all('abababa', 'aba')

## __Extract Matches__

In [153]:
length(sentences)

In [154]:
head(sentences)

Imagine we want to find all sentences that contain a color. We first
create a vector of color names, and then turn it into a single regular
expression:

In [155]:
colors <- c('red', 'orange', 'yellow', 'green', 'blue', 'purple')
color_match <- str_c(colors, collapse = '|')
color_match

In [156]:
# now we can select sentences that contain a color,
# and then extract the color to figure out which one it is
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)

head(matches)

Note that `str_extract()` only extracts the first match. We can see
that most easily by first selecting all the sentences that have more
than one match

In [157]:
more <- sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)

In [158]:
str_extract(more, color_match)

This is a common pattern for stringr functions, because working
with a single match allows you to use much simpler data structures.
To get all matches, use `str_extract_all()` . It returns a list:

In [159]:
str_extract_all(more, color_match)

If you use `simplify = TRUE` , `str_extract_all()` will return a
matrix with short matches expanded to the same length as the
longest:

In [160]:
str_extract_all(more, color_match, simplify = TRUE)

0,1
blue,red
green,red
orange,red


In [161]:
x <- c('a', 'a b', 'a b c')
str_extract_all(x, '[a-z]', simplify = TRUE)

0,1,2
a,,
a,b,
a,b,c


2. From the Harvard sentences data, extract:
* The first word from each sentence.
* All words ending in ing .
* All plurals.

In [162]:
str_extract(sentences, "[A-ZAa-z]+")

In [163]:
pattern <- "\\b[A-Za-z]+ing\\b"
sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern))) %>%
  head()

In [164]:
unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}s\\b"))) %>%
  head()

## __Grouped Matches__

Earlier in this chapter we talked about the use of parentheses for
clarifying precedence and for backreferences when matching. You
can also use parentheses to extract parts of a complex match. For
example, imagine we want to extract nouns from the sentences. As a
heuristic, we’ll look for any word that comes after “a” or “the”. Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation—a sequence of at least one character that isn’t
a space:

In [165]:
noun <- '(a|the) ([^ ]+)'

In [166]:
has_noun <- sentences %>%
    str_subset(noun) %>%
    head(10)

In [167]:
has_noun %>%
    str_extract(noun)

`str_extract()` gives us the complete match; `str_match()` gives
each individual component. Instead of a character vector, it returns
a matrix, with one column for the complete match followed by one
column for each group:

In [168]:
has_noun %>%
    str_match(noun)

0,1,2
the smooth,the,smooth
the sheet,the,sheet
the depth,the,depth
a chicken,a,chicken
the parked,the,parked
the sun,the,sun
the huge,the,huge
the ball,the,ball
the woman,the,woman
a helps,a,helps


If your data is in a tibble, it’s often easier to use `tidyr::extract()` .
It works like `str_match()` but requires you to name the matches,
which are then placed in new columns:

In [169]:
tibble(sentence = sentences) %>%
    extract(sentence, c('article', 'noun'), '(a|the) ([^ ]+)',
            remove = FALSE)

sentence,article,noun
<chr>,<chr>,<chr>
The birch canoe slid on the smooth planks.,the,smooth
Glue the sheet to the dark blue background.,the,sheet
It's easy to tell the depth of a well.,the,depth
⋮,⋮,⋮
A severe storm tore down the barn.,the,barn.
She called his name many times.,,
"When you hear the bell, come quickly.",the,"bell,"


## __Replacing Matches__

`str_replace()` and `str_replace_all()` allow you to replace
matches with new strings. The simplest use is to replace a pattern
with a fixed string:

In [170]:
x <- c('apple', 'pear', 'banana')
str_replace(x, '[aeiou]', '-')

In [171]:
str_replace_all(x, '[aeiou]', '-')

With `str_replace_all()` you can perform multiple replacements
by supplying a named vector:

In [172]:
x <- c('1 house', '2 cars', '3 people')
str_replace_all(x, c('1' = 'one', '2' = 'two', '3' = 'three'))

Instead of replacing with a fixed string you can use backreferences
to insert components of the match. In the following code, I flip the
order of the second and third words:

In [173]:
sentences %>%
    str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
    head(5)

## __Splitting__

Use `str_split()` to split a string up into pieces. For example, we
could split sentences into words:

In [174]:
sentences %>%
    head(5) %>%
    str_split(' ')

Because each component might contain a different number of
pieces, this returns a list. If you’re working with a length-1 vector,
the easiest thing is to just extract the first element of the list:

In [175]:
'a|b|c|d' %>%
    str_split('\\|') %>%
    .[[1]]

In [176]:
# or return a matrix
sentences %>%
    head(5) %>%
    str_split(' ', simplify = TRUE)

0,1,2,3,4,5,6,7,8
The,birch,canoe,slid,on,the,smooth,planks.,
Glue,the,sheet,to,the,dark,blue,background.,
It's,easy,to,tell,the,depth,of,a,well.
These,days,a,chicken,leg,is,a,rare,dish.
Rice,is,often,served,in,round,bowls.,,


In [177]:
# request max number of pieaces
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>%
    str_split(': ', n = 2, simplify = TRUE)

0,1
Name,Hadley
Country,NZ
Age,35


Instead of splitting up strings by patterns, you can also split up by
character, line, sentence, and word `boundary()` s:

In [178]:
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary('word'))

In [179]:
str_split(x, ' ')[[1]]

In [180]:
str_split(x, boundary('word'))[[1]]

## __Find Matches__
`str_locate()` and `str_locate_all()` give you the starting and
ending positions of each match. These are particularly useful when
none of the other functions does exactly what you want. You can use
`str_locate()` to find the matching pattern, and `str_sub()` to
extract and/or modify them.

## __Other Types of Pattern__

When you use a pattern that’s a string, it’s automatically wrapped
into a call to `regex()` :

In [181]:
# the regular call
str_view(fruit, 'nana')

In [183]:
# is shorthand for
str_view(fruit, regex('nana'))

`ignore_case = TRUE` allows characters to match either their
uppercase or lowercase forms. This always uses the current
locale:

In [184]:
bananas <- c('banana', 'Banana', 'BANANA')
str_view(bananas, 'banana')

In [185]:
str_view(bananas, regex('banana', ignore_case = TRUE))

`multiline = TRUE` allows `^` and `$` to match the start and end of
each line rather than the start and end of the complete string:

In [186]:
x <- "Line 1\nLine 2\nLine 3"

In [187]:
str_extract_all(x, "^Line")[[1]]

In [189]:
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]

`comments = TRUE` allows you to use comments and white space
to make complex regular expressions more understandable.
Spaces are ignored, as is everything after # . To match a literal
space, you’ll need to escape it: `"\\ "` .

In [190]:
phone <- regex('
    \\(?     # optional opening parens
    (\\d{3}) # area code
    [)- ]?   # optional closing parens, dash, or space
    (\\d{3}) # another three numbers
    [ -]?    # optional space or dash
    (\\d{3}) # three more numbers
', comments = TRUE)

In [191]:
str_match('514-791-8141', phone)

0,1,2,3
514-791-814,514,791,814


`dotall = TRUE` allows . to match everything, including `\n` .