# String and text manipulation

## String concatenation

The `"paste"` function is used to concatenate strings, The `"paste0"` modification does not leave spaces between items. In the example below we generate SQL strings using this function:

In [5]:
locations <- c("London", "New York", "Hong Kong")
queries <- paste0("select * from customer_table where location='", locations, "';")
queries

We can collapse these strings together instead of generating them separately:

In [7]:
paste0("select * from customer_table where location='", locations, "'", collapse = "; ")

## Searching strings with grep

I have a set of file names:

In [8]:
files <- paste0("file_", sample(1:100, 30), sample(c(".csv", ".tsv", ".txt", ".jpg"), 30, T))
files

I'd like to select only `"csv"` files for analysis

In [12]:
grep("csv$", files, value = TRUE)

## String substitution using gsub

I'd like to change the substring `"file"` with `"item"` in the file names:

In [14]:
gsub("file", "item", files)

Next I want the file names without extensions, `"strsplit"` returns a list type which we will come to later

In [18]:
# Note we need to escape the "." which is a special character in grep
print(strsplit(files[1:5], "\\."))

[[1]]
[1] "file_14" "jpg"    

[[2]]
[1] "file_95" "jpg"    

[[3]]
[1] "file_90" "jpg"    

[[4]]
[1] "file_89" "tsv"    

[[5]]
[1] "file_31" "txt"    



## Other useful string operations

The `"casefold"` function transforms a string to upper or lower (see also `"toupper"` and `"tolower"`):

In [19]:
casefold("loWer", upper = FALSE)

In [21]:
# The number of characters in each item:
nchar(files)

In [22]:
# Selecting part of a string
substring("visualize this", 1, 5)

# Exercise 1.9

**Question 1**

You work in a UK marketing company and you would like to run models on four different `"customer_locations"` London, Manchester, Edinburgh, Cardiff. Use the `"paste"` or `"paste0"` functions to generate the SQL strings for extracting the data from your "customer_table".

**Question 2**

Your boss likes the analysis you carried out and want you to run models by `"customer_location"` and `"age_group"`: `"20-30"`, `"31-40"`, `"41-50"`, `"51-60"`. Use `"paste"` or `"paste0"` functions to generate the SQL statements and assign them to the `"sql_strings"` variable. Hint: you'll probably need the `"expand.grid"` function.

**Question 3**

It turns out that Cardiff has miss-labelled and should be Swansea. Use the `"gsub"` function to fix this in "sql_strings".

**Question 4**

Save the `"sql_string"` to an RDS file (`"data1.rds"`) using `"saveRDS"`. Now read the file back and assign it to `"string"` using the `"readRDS"` function. Use R help to search the functions `"saveRDS"` and `"readRDS"`.

**Question 5**

Save the `"files"` vector and the `"sql_string"` vector to a file `"data2.RData"`, using the `"save"` function. Assign the number `1` to `"files"` and `"sql_string"` variables. Now using the `"load"` function to load the saved file. What happened? What is the difference between the `"save"` and `"load"` functions and the `"readRDS"` and `"saveRDS"` functions?