# R
## Strings, Performance, Misc

## Base R String Functions
- R has limited support for text processing
    - If this is the main purpose of a project, think about using another language
- Just like other functions in R, the string functions operate on vectors
- Common string functions
    - strsplit
    - grep/ grepl
    - nchar
    - toupper / tolower
    - substr

In [4]:
print(nchar(c("I'm a little teapot","short and stout")))
print(nchar(c("I'm a little teapot", 14)))
print(nchar("I the only string"))

[1] 19 15
[1] 19  2
[1] 17


In [5]:
str_vector <- c("I'm a little teapot","short and stout",14,
                FALSE)
print(toupper(str_vector))
print(tolower(str_vector))

[1] "I'M A LITTLE TEAPOT" "SHORT AND STOUT"     "14"                 
[4] "FALSE"              
[1] "i'm a little teapot" "short and stout"     "14"                 
[4] "false"              


## Substring in R
- `substr` and `substring` take in 3 arguments, any of which can be vectors
```R
    substr(strings, start, end)
    substring(strings, first, last)
```
- If start or end is longer than the other, the values of the shorter one are recycled
    - Only `substring` repeats the strings

In [6]:
print(substr("Hello World",3,5))

[1] "llo"


In [13]:
print(substr("Hello World",1:3,1:3))
print(substring("Hello World",1:3,1:3))
print(substring("Hello World",c(1,2,3),c(1,2,3)))
print(substring("Hello World",5:20,10))
print(substring("Hello World",4,10:15))

[1] "H"
[1] "H" "e" "l"
[1] "H" "e" "l"
 [1] "o Worl" " Worl"  "Worl"   "orl"    "rl"     "l"      ""       ""      
 [9] ""       ""       ""       ""       ""       ""       ""       ""      
[1] "lo Worl"  "lo World" "lo World" "lo World" "lo World" "lo World"


In [17]:
str_vector <- c("I'm a little teapot","short and stout",14,FALSE)
print(substr(str_vector,2,1000L))
cat("\n")
print(substr(str_vector,1:5,1000L))
cat("\n")
print(substring(str_vector,1:5,1000L))
cat("\n")
print(substring(str_vector,1:15,1000))

[1] "'m a little teapot" "hort and stout"     "4"                 
[4] "ALSE"              

[1] "I'm a little teapot" "hort and stout"      ""                   
[4] "SE"                 

[1] "I'm a little teapot" "hort and stout"      ""                   
[4] "SE"                  "a little teapot"    

 [1] "I'm a little teapot" "hort and stout"      ""                   
 [4] "SE"                  "a little teapot"     " and stout"         
 [7] ""                    ""                    "ttle teapot"        
[10] " stout"              ""                    ""                   
[13] " teapot"             "ut"                  ""                   


## Regex in R
- Both strsplit as well as grep and grepl can take regular expressions
    - By default, these are POSIX style regular expressions
    - Pass perl=TRUE to use PCRE
- grep returns the indexes in the vector the match was found at
- grepl returns a logic vector indicating if an element of the vector matched

In [20]:
strings_with_spaces <- c("I am a string",
                         "I am one too",
                         "This also has spaces")
print(strsplit(strings_with_spaces,split=' '))

[[1]]
[1] "I"      "am"     "a"      "string"

[[2]]
[1] "I"   "am"  "one" "too"

[[3]]
[1] "This"   "also"   "has"    "spaces"



In [22]:
strings_with_spaces <- c("I am a string",
                         "I am one too",
                         "This also has spaces")
print(strsplit(strings_with_spaces,split="\\s",perl=TRUE))

[[1]]
[1] "I"      "am"     "a"      "string"

[[2]]
[1] "I"   "am"  "one" "too"

[[3]]
[1] "This"   "also"   "has"    "spaces"



In [23]:
strings_with_spaces <- c("I am a string","I am one too","This also has spaces")
print(strsplit(strings_with_spaces,split="\\W",perl=TRUE))

[[1]]
[1] "I"      "am"     "a"      "string"

[[2]]
[1] "I"   "am"  "one" "too"

[[3]]
[1] "This"   "also"   "has"    "spaces"



In [26]:
strings_with_spaces <- c("I am a string",
                         "I am one too",
                         "This also has spaces")
idx <- grep('I',strings_with_spaces,perl=TRUE)
print(strings_with_spaces[idx])

[1] "I am a string" "I am one too" 


In [27]:
grep('I',strings_with_spaces,perl=TRUE,ignore.case=TRUE)

In [28]:
grep('\\bI\\b',strings_with_spaces,perl=TRUE,ignore.case=TRUE)

In [29]:
grepl('\\bI\\b',strings_with_spaces,perl=TRUE,ignore.case=TRUE)

## The StringR library
- StringR is based on an older library, called stringi
- The aim is to 
    - improve consistency in function calls
    - make common string manipulation tasks easy
- Has robust multilingual support

In [30]:
library(stringr)

In [31]:
print(str_length(str_vector))

[1] 19 15  2  5


In [32]:
print(str_sort(str_vector))

[1] "14"                  "FALSE"               "I'm a little teapot"
[4] "short and stout"    


In [33]:
print(str_to_title(str_vector))

[1] "I'm A Little Teapot" "Short And Stout"     "14"                 
[4] "False"              


In [34]:
print(str_pad(str_vector,40))

[1] "                     I'm a little teapot"
[2] "                         short and stout"
[3] "                                      14"
[4] "                                   FALSE"


In [42]:
str_vector <- c("\n\rI am a string\t\t",
                         "I am one\ntoo",
                         "This also has spaces")
print(str_trim(str_pad(str_vector,40)))


[1] "I am a string"        "I am one\ntoo"        "This also has spaces"


In [43]:
str_c(str_vector,",")

In [44]:
str_c(str_vector,collapse=", ")

In [45]:
str_detect(str_vector,'o')

In [46]:
str_count(str_vector,'o')

## Directory Traversal in R
- Most scripting languages provided an easy way to iterate over files in a directory
    - This is known as globbing
    - It also allows wildcards to be used
- In R, the function is `Sys.glob` (note the uppercase)
    - Rather than returning an iterator, it returns a vector containing all the file names

In [47]:
print(Sys.glob("*.html"))

 [1] "Abilene_Christian_University.flat.html"
 [2] "Abilene_Christian_University.html"     
 [3] "cc.html"                               
 [4] "index.html"                            
 [5] "Lecture00.html"                        
 [6] "Lecture01.html"                        
 [7] "Lecture02.html"                        
 [8] "Lecture03.html"                        
 [9] "Lecture04.html"                        
[10] "Lecture05.html"                        
[11] "Lecture06.html"                        
[12] "Lecture07.html"                        
[13] "Lecture08.html"                        
[14] "Lecture09.html"                        
[15] "Lecture10.html"                        
[16] "Lecture11.html"                        
[17] "Lecture12.html"                        
[18] "Lecture13.html"                        
[19] "uni_webs.html"                         
[20] "Untitled.html"                         


## The readr package
- As an alternative to built-in data loading functions, some people use the readr package
    - I find the built in functions good enough usually
- readr provides the `read_file` and `write_file` functions
    - These read or write an entire file into a string ,or vice versa
    - This is possible in base R, but cumbersome, because you must calculate the length of the string first

In [48]:
library(readr)

In [54]:
contents <- read_file("index.html")

In [55]:
print(contents)

[1] "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"UTF-8\" />\n    <title>UMBC: An Honors University In Maryland</title>\n\n    <!-- Always force latest IE rendering engine (even in intranet) & Chrome Frame -->\n    <!-- <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\"> -->\n\n    <!-- Sets the viewport width to the width of the device, so media queries work -->\n    <!-- NOTE: We're locking the max scale (which prevents zooming) to fix bugs\n         during orientation changes on devices.  Our styles should accomodate this though. -->\n<link rel=\"image_src\" href=\"http://www.umbc.edu/images/UMBC_fb_tmb.png\" />\n<meta name=\"description\" content=\"\" />\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1\">\n    <link rel=\"icon\" type=\"image/png\" href=\"images_homepage/icon.png\" />\n\n    <!-- Template Styles -->\n    <link rel=\"stylesheet\" type=\"text/css\" href=\"stylesheets/homep

In [56]:
print(str_extract_all(contents,'<a href=".*?">.*</a>'))

[[1]]
  [1] "<a href=\"#main-content\">Skip to Main Content</a>"                                                                                                                                                                                                                                                                                                                                                                                                   
  [2] "<a href=\"http://bit.ly/1EJ0VYN\">Due to inclement weather UMBC will be opening at noon on Monday, March 2.</a>"                                                                                                                                                                                                                                                                                                                                      
  [3] "<a href=\"http://about.umbc.edu/inclement-weather-emergency-closing-policy/\">Due to inclement 

## Performance in R
- R is commonly viewed as a slow language
    - Mostly because it is
- We can still optimized and make sure to program in an R style
    - Avoid for loops if you can use a vectorized function
    - S4 methods are slower than S3, which is slower than a direct function call
    - Consider bytecode compilation

## Profiling your code
- The `microbenchmark` library provides the `microbenchmark` function
    - Takes in several functions, runs them all, and prints statistics
- For line-by-line profiling, use the `profvis` package
    - Uses a web browser to show results
  

In [57]:
library(microbenchmark)
nums <- matrix(c(1:5000),nrow=100)
print(
    microbenchmark(
    colMeans(nums),
    apply(nums,2,mean)
    )
)

Unit: microseconds
                 expr     min       lq      mean   median       uq     max
       colMeans(nums)  11.152  12.7680  15.84969  15.9755  17.4075  62.653
 apply(nums, 2, mean) 379.873 413.9685 445.47625 447.4590 465.7440 566.426
 neval cld
   100  a 
   100   b


In [59]:
## Needs to be run in RStudio
library(profvis)
print(
    profvis(
        {
    nums <- matrix(c(1:50000),nrow=100)  
    apply(nums,2,mean)
}
    ))

ERROR: Error in parse_rprof(prof_output, expr_source): No parsing data available. Maybe your function was too fast?


## Parallelism
- Because of its functional design, R is a perfect language for parallelization
- The library `parallel` provides a mutlicore versions of mapply and lapply,
    - `mclapply`
    - `mcmapply`
    
```R
    mclapply(vector,function,mc.cores=N_CORES)
    mclapply(vector,function,axis,mc.cores=N_CORES)
    
```

In [60]:
library(parallel)
print(detectCores())

[1] 24


In [61]:
seed_strings <- c("asdf","ghhjk",'qerwet',
                  'uopi','zxcv','asdgf')
lots_of_strings <- rep(seed_strings,20000)
print(
    microbenchmark(
        lapply(lots_of_strings,str_length),
        mclapply(lots_of_strings,str_length,mc.cores=12)
    )
)

Unit: milliseconds
                                                 expr       min        lq
                  lapply(lots_of_strings, str_length) 190.71839 210.76317
 mclapply(lots_of_strings, str_length, mc.cores = 12)  77.60858  89.12278
     mean   median       uq      max neval cld
 234.1033 222.2979 252.8531 347.1181   100   b
 113.6842 105.3007 124.2989 221.5851   100  a 


In [62]:
cl <- makeCluster(12)
print(
    microbenchmark(
        colMeans(nums),
        apply(nums,2,mean),
        parCapply(cl,nums,mean)
        )
    )
stopCluster(cl)

Unit: microseconds
                      expr       min        lq       mean     median        uq
            colMeans(nums)    65.955    77.095   108.2953   119.7595   131.979
      apply(nums, 2, mean)  3678.633  3925.211  5379.1383  5670.4275  6425.707
 parCapply(cl, nums, mean) 10723.515 46655.732 47832.5941 47941.8125 49219.075
       max neval cld
   147.574   100 a  
  8962.301   100  b 
 56599.978   100   c


## Presenting Data
- R is often used in the analysis phase of research
    - Especially to produce nice graphics
- Packages exist that allow papers to be written in R, combined with code
    - knitr is a very popular one

## KnitR
- `knitR` allows a document to be written in 
    - R-style Markdown
    - HTML
    - LaTeX
- R code is set off in these documents using various conventions
- Code is executed and results displayed inline correctly

In [63]:
library(knitr)
knit('005-latex.Rtex')



processing file: 005-latex.Rtex


  |......                                                           |   9%
  ordinary text without R code

  |............                                                     |  18%
label: setup (with options) 
List of 1
 $ include: logi FALSE

  |..................                                               |  27%
  ordinary text without R code

  |........................                                         |  36%
label: unnamed-chunk-1
  |..............................                                   |  45%
  ordinary text without R code

  |...................................                              |  55%
label: my-cache (with options) 
List of 1
 $ cache: logi TRUE

  |.........................................                        |  64%
   inline R code fragments

  |...............................................                  |  73%
label: cairo-scatter (with options) 
List of 4
 $ dev       : chr "cairo_pdf"
 $ fig.width : num 5
 $ fig.height: num 5
 $ out.

output file: 005-latex.tex



## Loading Libraries from Non Default Locations
- By default, R tries to install and looks for packages in a location that needs sudo access to write
- You can change where libraries are installed by adding the `lib` parameter to `install.packages`
- There are numerous ways to tell where to look for libraries, including in the `library` function
    - The most consistent way is to set the environmental variable `R_LIBS_USER` in your shell before calling R