In [2]:
library(tidyverse)

There are four main families of functions in stringr:

* Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors.

* Whitespace tools to add, remove, and manipulate whitespace.

* Locale sensitive operations whose operations will vary from locale to locale.

* Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are three other tools.

# Getting and setting individual characters

get the length of the string with **`str_length()`**

In [3]:
str_length('VN Pikachu')

In [5]:
players <- c('1234', '123456789')
str_length(players)

You can access individual character using **`str_sub()`**

 It takes three arguments: a character vector, a start position and an end position. Either position can either be a positive integer, which counts from the left, or a negative integer which counts from the right. The positions are inclusive, and if longer than the string, will be silently truncated. This works like PYTHON

In [22]:
s <- '12345'

In [23]:
str_sub(s, 3) #python: s[2:] (0-based)

In [20]:
str_sub(s, end = 3) #python s[:3] (0-based)

In [28]:
str_sub(s, 2, 4) #python s[1:4] (0-based)

In [27]:
str_sub(s, 2, -2) #python s[1:-1] (0-based)

Vectorize

In [12]:
#slice the first value from 2-index to 3-index, slice the second value from 3-th index to 5-th index
str_sub(players, start = c(2, 3) , end = c(3, 5))

In [10]:
#single value will be broadcast (like numpy)
str_sub(players, start = 1, end = c(3, 5))   #equivalent start = c(1, 1)

In [11]:
str_sub(players, start = 3, end = 5) #equivalent start = c(3, 3), end = c(5, 5)

You can also use **`str_sub()`** to modify strings:

In [29]:
s

In [31]:
str_sub(s, 2, 4) <- '-1'
s

In [32]:
#Vectorize
values <- c('00xx00', '11xx11')
str_sub(values, 3, 4) <- 'hidden information'

values

To duplicate individual strings, you can use **`str_dup()`**:

In [33]:
str_dup('01', 5)

In [34]:
#vectorize
str_dup(c('a', 'b'), c(5, 8))

# Whitespace

Three functions add, remove, or modify whitespace:

1. **`str_pad()`** pads a string to a fixed length by adding extra whitespace on the left, right, or both sides.

In [35]:
args(str_pad)

In [37]:
s <- '1'

In [40]:
#left padding with 0, the result will have width = 5
s %>% str_pad(width = 5, side = 'left', pad = '0')

In [42]:
#center with = 5
s %>% str_pad(width = 5, side = 'both')

2. **`str_trim()`** which removes leading and trailing whitespace:

In [43]:
args(str_trim)

In [44]:
value <- '  1   '

In [45]:
value %>% str_trim('left')

In [46]:
value %>% str_trim('right')

In [48]:
value %>% str_trim('both')

3. You can use **`str_wrap()`** to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.

In [51]:
jabberwocky <- str_c(
  "`Twas brillig, and the slithy toves ",
  "did gyre and gimble in the wabe: ",
  "All mimsy were the borogoves, ",
  "and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))


`Twas brillig, and the slithy toves did
gyre and gimble in the wabe: All mimsy
were the borogoves, and the mome raths
outgrabe.

# Locale sensitive

A handful of stringr functions are locale-sensitive: they will perform differently in different regions of the world. These functions are case transformation functions:

In [52]:
str_to_upper('quactinh')

In [54]:
str_to_title('vn pikachu awesome')

In [57]:
str_to_lower('I like horses')
# Turkish has two sorts of i: with and without the dot
str_to_lower('I like horses', locale = "tr")

The locale always defaults to English to ensure that the default behaviour is identical across systems. Locales always include a two letter ISO-639-1 language code (like “en” for English or “zh” for Chinese), and optionally a ISO-3166 country code (like “en_UK” vs “en_US”). You can see a complete list of available locales by running **`stringi::stri_locale_list()`**.

# Pattern Matching

The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.

### Tasks

Each pattern matching function has the same first two arguments, a character vector of strings to process and a single pattern to match. stringr provides pattern matching functions to:
* detect
* locate
* extract
* match
* replace
* split   

I’ll illustrate how they work with some strings and a regular expression designed to match (US) phone numbers:

**`str_detect()`** detects the presence or absence of a pattern and returns a logical vector

In [86]:
s <- 'a3233'
vec_str <- c('a3332 33323', 'afda', '22a3', '3232a3')

In [87]:
#Check if string contains at least 3 consecutive digits
s %>% str_detect('[0-9]{3,}')

**`str_subset()`** returns the elements of a character vector that match a regular expression

In [88]:
#Filter values that matches Regex
vec_str %>% str_subset('[0-9]{3,}')

**`str_count()`** counts the number of matches:

In [90]:
#For each element, count the number of matching
vec_str %>% str_count('[0-9]{3,}')

**`str_locate()`** locates the first position of a pattern and returns a numeric matrix with columns start and end.<br> **`str_locate_all()`**  locates all matches, returning a list of numeric matrices

In [91]:
vec_str[1]

In [101]:
#return a matrix of 1 row
vec_str[1] %>% str_locate('\\d{3,}')

In [100]:
#return a matrix, if there are n matches, this matrix will have n rows
vec_str[1] %>% str_locate_all('\\d{3,}')

start,end
2,5
7,11


Vectorize

In [99]:
#return a matrix
vec_str %>% str_locate('\\d{3,}')

start,end
2.0,5.0
,
,
1.0,4.0


In [97]:
#Return a list, each element is a matrix
vec_str %>% str_locate_all('\\d{3,}')

start,end
2,5
7,11

start,end

start,end

start,end
1,4


**`str_extract()`** extracts text corresponding to the first match, returning a character vector.  
**`str_extract_all()`** extracts all matches and returns a list of character vectors.

In [102]:
vec_str[1]

In [104]:
vec_str[1] %>% str_extract('\\d{3,}')

In [107]:
vec_str[1] %>% str_extract_all('\\d{3,}')

In [110]:
#vectorize

#return a vector
vec_str %>% str_extract('\\d{3,}')

In [114]:
#return a list of character vectors
vec_str %>% str_extract_all('\\d{3,}')

**`str_match()`** extracts capture groups formed by `()` from the first match. It returns a character matrix with one column for the complete match and one column for each group.  
**`str_match_all()`** extracts capture groups from all matches and returns a list of character matrices. 

In [115]:
vec_str

In [118]:
vec_str %>% str_match('[a-z]+(\\d+)')

0,1
a3332,3332.0
,
a3,3.0
a3,3.0


In [119]:
vec_str %>% str_match_all('[a-z]+(\\d+)')

0,1
a3332,3332

0,1
a3,3

0,1
a3,3


**`str_replace()`** replaces the first matched pattern and returns a character vector.   
**`str_replace_all()`** replaces all matches.

In [120]:
vec_str

In [123]:
vec_str %>% str_replace('\\d+', '?')

In [124]:
vec_str %>% str_replace_all('\\d+', '?')

**`str_split_fixed()`** splits a string into a fixed number of pieces based on a pattern and returns a character matrix.   
**`str_split()`** splits a string into a variable number of pieces and returns a list of character vectors.

In [126]:
values <- c('a  da  ada    e', 'a   b     c   d    e    f')

In [131]:
#maximum 2 split
values %>% str_split_fixed('\\s+', n = 3)

0,1,2
a,da,ada e
a,b,c d e f


In [129]:
values %>% str_split('\\s+')

# Engines

### Fixed Matches

**`fixed(x)`** only matches the exact sequence of bytes specified by `x`. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using **`fixed()`** with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent:

In [132]:
a1 <- "\u00e1"
a2 <- "a\u0301"

In [133]:
a1

In [134]:
a2

In [135]:
a1 == a2

They render identically, but because they’re defined differently, `fixed()` doesn’t find a match. Instead, you can use `coll()`, explained below, to respect human character comparison rules:

In [139]:
a1 %>% str_detect(fixed(a2))

In [140]:
#You must use str_detect to check, think of coll(a2) return a Regex
a1 %>% str_detect(coll(a2))

# Collation search

**`coll(x)`** looks for a match to x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a `locale` parameter.

In [141]:
i <- c("I", "İ", "i", "ı")

i

In [142]:
i %>% str_subset(coll("i", ignore_case = TRUE))

In [144]:
i %>% str_subset(coll("i", ignore_case = TRUE, locale = "tr"))

The downside of **`coll()`** is speed. Because the rules for recognising which characters are the same are complicated, `coll()` is relatively slow compared to `regex()` and `fixed()`. Note that when both `fixed()` and `regex()` have `ignore_case a`rguments, they perform a much simpler comparison than `coll()`.

# Boundary

**`boundary()`** matches boundaries between characters, lines, sentences or words. It’s most useful with **`str_split()`**, but can be used with all pattern matching functions:

In [147]:
x <- "This is a sentence."

In [148]:
str_split(x, boundary("word"))

In [149]:
str_count(x, boundary("word"))
#> [1] 4

In [151]:
str_extract_all(x, boundary("word"))

By convention, `""` is treated as `boundary("character")`:

In [152]:
x %>% str_split('')

In [153]:
x %>% str_count('')