In this notebook, we will cover:

* [String basics and length](#String-basics-and-length)
* [Combining strings](#Combining-strings)
* [Subsetting strings](#Subsetting-strings)
* [Locales](#Locales)

In [1]:
library(tidyverse)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.8
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# String basics and length

In [2]:
mystring <- '\'"\''
writeLines(mystring)
str_length(mystring)

'"'


In [3]:
(mystring <- "STATS 306")

In [4]:
(mystring2 <- 'STATS 306')

You can create strings using double quotes or single quotes -- there's no difference. However, for consistency, you might want to stick to double quotes except in cases when your strong itself has double quotes.

In [5]:
(mystring3 <- '"MLE" stands for "Maximum Likelihood Estimate"')

You can create a string with double quotes in it while using double quotes to create it but you will have to use a lot of *escape* all the double quotes!

In [6]:
(mystring3 <- "\"MLE\" stands for \"Maximum Likelihood Estimate\"")

This is also means that if you actually want a backslash in your string, you need to escape it as well.

In [7]:
(mystring4 <- "\\ is the backslash character")

In [8]:
writeLines(mystring4)

\ is the backslash character


In [9]:
(mystrings <- c("\"", '"', '\'', "'", "\\", "\\/")) # note the use of c() to create a character vector

The printed representation of strings shows the escapes.

In [10]:
print(mystrings)

[1] "\""  "\""  "'"   "'"   "\\"  "\\/"


Use `writeLines()` to see the raw contents of the string. 

In [11]:
writeLines(mystrings)

"
"
'
'
\
\/


It is good to know of a few about a few more escape sequences.

In [12]:
writeLines("First line\nSecond line") # newline

First line
Second line


In [13]:
writeLines("Text\n\tIndented Text\nText") # newlines and tab

Text
	Indented Text
Text


You can also print characters if you know their unicode using `\u`. For example, the copyright character has unicode `00A9`. Wikipedia has [a complete list](https://en.wikipedia.org/wiki/List_of_Unicode_characters).

In [14]:
writeLines("\u0394 \u00A9")

Δ ©


Base R has strong functions but we will avoid them use the ones in the `stringr` package. They all start with `str_`

In [15]:
str_length(c("a", "character", "vector"))

# Combining strings

In [16]:
str_c("Let us con", "catenate strings!")

In [17]:
(mystrings <- c("one", "two", "ten"))

In [18]:
str_c("*** ", mystrings , " ***") # each argument is expanded to the length of the longest

In [19]:
(mystrings_na <- c("one", "two", NA))

In [20]:
str_c("*** ", mystrings_na, " ***") # missingness is contagious!

In [21]:
str_c("*** ", str_replace_na(mystrings_na), " ***") # converts missing values to the string "NA"

In [22]:
# make sure you understand the difference between NA and "NA"
str_length(NA)
str_length("NA")

In [23]:
str_c("one", "two", "ten", sep = ", ") # can provide a separator

In [24]:
str_c(mystrings, sep = ", ") # why does this not combine the strings?

In [25]:
str_c(mystrings, collapse = ", ") # use collapse if the strings you want to combine are in a vector

# Subsetting strings

In [26]:
letters

In [27]:
(letters_str <- str_c(letters, collapse = ""))

In [28]:
str_sub(letters_str, 1, 10) # the substring from position 1 through 10 (both inclusive)

In [29]:
str_sub(letters_str, -10, -1) # negative numbers count from the end

In [30]:
str_sub(letters_str, 31, 40) # ranges outside the string will result in an emtpy stroing

In [31]:
str_sub(letters_str, 21, 30) # works but results in a shorter resulting string

You can change part of a string using the assignment form of `str_sub()`

In [32]:
str_sub(letters_str, 1, 10) <- ( str_sub(letters_str, 1, 10) %>% str_to_upper )
letters_str

What do you think `str_to_upper()` does?

Have a look at the [stringr documentation](http://stringr.tidyverse.org/reference/index.html) to learn about other useful functions.

In [33]:
(my_string <- str_c(str_c(letters,collapse=""),str_c(LETTERS,collapse="")))
str_length(my_string)

In [34]:
f <- "first"
s <- "second"
t <- "third"
fth <- "fourth" 
vec1 <- c(f, s)
vec2 <- c(t, fth)
str_c(vec1, vec2)

In [35]:
(collapsed_letters <- str_c(letters, collapse=","))
(str_split(collapsed_letters, ","))

In [36]:
path_to_file <- "My Documents/movies/my movie"
writeLines(path_to_file)
str_split(path_to_file, "/")

My Documents/movies/my movie


In [37]:
(test_string <- str_replace_na(c("Michigan",NA,"Ohio")))

In [38]:
(test_string <- ifelse(test_string == "NA", NA, test_string))

# Locales

In [39]:
str_to_title("this could be the title of a book")

In [40]:
LETTERS
str_to_lower(LETTERS)

In [41]:
(LETTERS_str <- str_c(LETTERS, collapse = ""))
str_to_lower(LETTERS_str)

These three functions:

* `str_to_lower()`
* `str_to_upper()`
* `str_to_title()`

are <a href="https://en.wikipedia.org/wiki/Locale_(computer_software)">**locale**</a> sensitive. That is, their behavior is sensitive to which language the user's computer system is using. A two letter code (e.g., `en` for English, `fr` for French, `es` for Spanish) can be passed to these functions as the `locale` argument to specify the locale (by default, the locale of the operating system is used).

Another function affected by locale is `str_sort()`

In [42]:
str_sort(letters, locale = "en")

In [43]:
str_sort(letters, locale = "haw") # vowels come before consonants in the Hawaiian alphabet

In [44]:
str_sort(c(LETTERS, letters))

In [45]:
(printable_chars <- str_split(intToUtf8(32:122), "")[[1]])

In [46]:
printable_chars %>% str_sort()

# WARNING

There is an `str_sort()` function in the Kmisc package too

See https://www.rdocumentation.org/packages/Kmisc/versions/0.5.0/topics/str_sort

For that function:

> "Lower-case letters are, by default, 'larger' than upper-case letters.

but this is not the case for `str_sort()` in the `stringr` package. Only the `stringr` package will be used in 306.