In this notebook, we will cover topics related to regular expressions:

* [Basic matches](#Basic-matches)
* [Anchors](#Anchors)
* [Character classes](#Character-classes)
* [Alternatives](#Alternatives)

# Basic matches

In [1]:
library(tidyverse)
library(stringr)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


In [2]:
x <- c("statistics", "data science", "machine learning")
str_view(x, "at") # search for the pattern 'an'

In [3]:
str_view(x, ".i.") # . matches any single character except newline

In [4]:
str_view_all(x, ".i.") # shows all matches, not just the first one

In [5]:
mystring = "A . string . with . periods ..."

In [6]:
str_view(mystring, ".") # doesn't match the period since . has a special meaning in regexps

In [7]:
str_view(mystring, "\\.") # why do we need double backslashes?

In [8]:
str_view_all(mystring, "\\.")

In [9]:
mystring2 <- "A \\ string \\ with \\ backslashes \\ \\ \\"
writeLines(mystring2)

A \ string \ with \ backslashes \ \ \


In [10]:
str_view_all(mystring2, "\\") # doesn't work

ERROR: Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE)


In [11]:
str_view_all(mystring2, "\\\\") # Gosh -- need 4 backslashes to match a single backslash!!!

In [12]:
x <- c("a.b.c.d", ".a.b.", "a.b.", ".1.1.1", ". . . ", "...abc")

We want to match the pattern:

`<period> <any char> <period> <any char> <period> <any char>`

The regular expression for this is:

`\..\..\..`

So the string representing the regular expression is:

`"\\..\\..\\.."`

In [13]:
re <- "\\..\\..\\.."
str_view(x, re)

# Anchors

In [14]:
x <- c("statistics", "data science", "machine learning")
str_view(x, "s")

In [15]:
str_view(x, "^s") # ^ matches the beginning of a string

In [16]:
str_view(x, "s$") # $ matches the ending of a string

In [17]:
y <- c("earn $$$", "cost is $10", "nothing to do with money!")
str_view(y, "$") # not the write way to match a literal $ sign

In [18]:
str_view(y, "\\$")

# Character classes

These special patterns match characters in a class:

* `\d`: matches any digit.
* `\s`: matches any whitespace (e.g. space, tab, newline).
* `[abc]`: matches a, b, or c.
* `[^abc]`: matches anything *except* a, b, or c.

In [19]:
x <- c("1", "one", "23", "twothree", "r2d2")
str_view(x, "\\d") # why double backslash?

In [20]:
x <- c("one", "two", "three", "four", "five")
vowel_re <- "[aeiou]"
str_view(x, vowel_re)

# Alternatives

Alternative patterns can be matched using `|`.

In [21]:
color_re = "colo(r|ur)"
x <- c("color", "red colour", "coloured glass", "chair", "colored chair")
str_view(x, color_re)

Suppose we want to match telephone numbers of the form:

* xxx-xxx-xxxx
* (xxx) xxx-xxxx

In [22]:
phone_re = "(\\d\\d\\d-|\\(\\d\\d\\d\\) )\\d\\d\\d-\\d\\d\\d\\d" # complicated because of all the double backslashes

In [23]:
n <- c("123-456-7890", "(123) 456-7890", "1234567890", "+1-123-456-7890")
str_view(n, phone_re)