In this notebook, we will cover topics related to regular expressions:

* [Basic matches](#Basic-matches)
* [Anchors](#Anchors)
* [Character classes](#Character-classes)
* [Alternatives](#Alternatives)

# Basic matches

In [1]:
library(tidyverse)
library(stringr)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.8
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



We looked at the `str_c` function last time that is used to concatenate strings (and vectors of strings). The opposite of concatenating is splitting, a job done by the `str_split` function.

In [2]:
fruits <- c("an apple, a banana, a cherry", "a kiwifruit, a lemon, a mango")
(after_split <- str_split(fruits, ", "))

In [3]:
# str_split returns a list of vectors, one vector for each string you're splitting
# here we're just splitting one string so we get a list with one element that we extract using [[1]]
(after_split <- str_split("an apple, a banana, a cherry", ", ")[[1]])
# str_c will undo the split
str_c(after_split, collapse=", ")

In [4]:
str_split('this/is/a/unix/style/file/path','/')

In [5]:
path <- 'this\\is\\a\\windows\\style\\file\\path'
writeLines(path)

this\is\a\windows\style\file\path


In [6]:
str_split(path, '\\')

ERROR: ignored

Why does the command not work? If you look up the documentation of `str_split()`, you will find that that the pattern given for splitting is interpreted as a **regular expression**. A regular expression is a way to describe a systematic pattern in strings. The simplest pattern is simple a given, fixed string.

In [7]:
x <- c("statistics", "data science", "machine learning")
str_detect(x, "at") # search for the pattern 'at'

In [8]:
str_extract(x, "at") 

In [9]:
str_detect(x, ".i.") # . matches any single character except newline

In [10]:
str_extract(x, ".i.")

In [11]:
str_extract_all(x, ".i.") # extracts all matches, not just the first one

In [12]:
mystring = "A . string . with . periods ..."

In [13]:
str_detect(mystring, ".") # doesn't match the period since . has a special meaning in regexps

In [14]:
str_detect(mystring, "\.") # why does this not work?

ERROR: ignored

In [15]:
str_detect(mystring, "\\.") # why do we need double backslashes?

In [16]:
str_extract_all(mystring, "\\.")

In [17]:
mystring2 <- "A \\ string \\ with \\ backslashes \\ \\ \\"
writeLines(mystring2)

A \ string \ with \ backslashes \ \ \


In [18]:
str_extract_all(mystring2, "\\") # doesn't work

ERROR: ignored

In [19]:
str_extract_all(mystring2, "\\\\") # Gosh -- need 4 backslashes to match a single backslash!!!

Now we know how to fix the problem with the `str_split()` code above.

In [20]:
writeLines(path)
str_split(path, '\\\\')

this\is\a\windows\style\file\path


In [21]:
x <- c("a.b.c.d", ".a.b.", "a.b.", ".1.1.1", ". . . ", "...abc")

We want to match the pattern:

`<period> <any char> <period> <any char> <period> <any char>`

The regular expression for this is:

`\..\..\..`

So the string representing the regular expression is:

`"\\..\\..\\.."`

In [22]:
re <- "\\..\\..\\.."
str_detect(x, re)

# Anchors

In [23]:
x <- c("statistics", "data science", "machine learning")
str_detect(x, "s")

In [24]:
str_detect(x, "^s") # ^ matches the beginning of a string

In [25]:
str_detect(x, "s$") # $ matches the ending of a string

In [26]:
y <- c("earn $$$", "cost is $10", "nothing to do with money!")
str_detect(y, "$") # not the write way to match a literal $ sign

In [27]:
str_detect(y, "\\$")

In [28]:
str_extract_all(y, "\\$")

# Character classes

These special patterns match characters in a class:

* `\d`: matches any digit.
* `\s`: matches any whitespace (e.g. space, tab, newline).
* `[abc]`: matches a, b, or c.
* `[^abc]`: matches anything *except* a, b, or c.

In [29]:
x <- c("1", "one", "23", "twothree", "r2d2")
str_detect(x, "\\d") # why double backslash?

In [30]:
x <- c("one", "two", "three", "four", "five")
vowel_re <- "[aeiou]"
str_extract_all(x, vowel_re)

# Alternatives

Alternative patterns can be matched using `|`.

In [31]:
color_re = "colo(r|ur)"
x <- c("color", "red colour", "coloured glass", "chair", "colored chair")
str_detect(x, color_re)

Suppose we want to match telephone numbers of the form:

* xxx-xxx-xxxx
* (xxx) xxx-xxxx

In [32]:
phone_re = "(\\d\\d\\d-|\\(\\d\\d\\d\\) )\\d\\d\\d-\\d\\d\\d\\d" # complicated because of all the double backslashes

In [33]:
n <- c("123-456-7890", "(123) 456-7890", "1234567890", "+1-123-456-7890")
str_detect(n, phone_re)