# CEB 35300, Phylogenetic Comparative Methods 2018
## Andrew Hipp, ahipp@mortonarb.org
### Session 0 (prior to class): Getting started in <code>R</code>

#### Environment -- the R session
R is a programming environment, organized as work sessions. Each work session has within it a host of objects---this includes variables and functions---enclosed within environments. Environments when you first start up R include all the packages that are included in the R base distribution, plus any that you specify to be loaded when you start up R. Start up R on your machine, and take a look at what's in there when you first start it.

In [1]:
search()

You see that there is something called '.GlobalEnv'. This is the environment in which you are working currently. The other environments are packages that were loaded when R started up. Whenever you type a function or variable name, as we did when we called the function "search", R searches backwards through the environments, beginning with the last one, and stops when it hits a match. If we were to create a new function called "search", it would live in the global environment, and R would find that function b/f it found the function search that is in the base package. Take a look:

In [2]:
search <- function(anyName = 'Otis') print(paste('Hello. I am looking for', 
                                           anyName, '-- can you help me?'))
search()

[1] "Hello. I am looking for Otis -- can you help me?"


Hold it! We have overwritten 'search'. Let's figure out where the original 'search' came from using '?search'. We find that it is from the base package, so we can access the object in a couple of ways:

In [3]:
get('search', envir = baseenv())() ## this is most general, and it will allow you to access objects that are in a non-package environment
base:::search() ## this works the same as the above, but it is a useful shortcut

Realizing now what we've done, we can remove the object we've created, no harm done.

In [4]:
rm(search)
search()

As you can see, we haven't removed the original function from the base environment, only the one we created. Whew!

When I begin a work session, a new project, I typically clear out all objects that might be hanging around from the last time I was working in R, and I move into a new folder. This only becomes very important at the end of your work session, when you want to save the work session and history. While you can navigate to a new folder at any time, I try to do so at the beginning of my work session so that I don't inadvertantly overwrite what I was working on the last time I was here, and so any files I write (e.g., PDFs, tab-delimited files, error logs or anything else) are stored in a single working folder.

In [5]:
ls() # this lists what is the in workspace at the time that I start up... 
        #it may be nothing, or all sorts of stuff.

In [6]:
rm(list = ls()) # this deletes EVERYTHING in the workspace. 
                # This command is ruthless and will not ask you
                # if you want to proceed.
ls() # this lists what's in your workspace now; nothing

In [7]:
getwd() # this command tells me what directory I'm in...
dir() # ... and this one tells me what is in that directory.

In [8]:
#setwd(choose.dir()) # To change directories interactively, you can use this 
                    # command, but in Windows only; if you are not using
                    # Windows, type the path in quotes between the 
                    # parentheses of "setwd".

I then attach the packages that I know or suspect that I'll need. If I don't do it right away, there's no great problem: packages can be attached any time, and you can navigate to a new folder at any time. 

In [9]:
library(ape) # here I'm attaching the Analysis of Phylogenetics and Evolution package,
             # if I didn't have that package installed on my machine, 
             # I would have to use install.packages('ape') before attaching it

### Assignment and data types
#### numeric vectors -- assignment of vectors, and operating on vectors
One of the oft-quoted principles of R is that everything is an object. What this means is that almost everything you do---typing a number or character, performing math, calling a function---produces an object that can be acted upon in some way. It can be printed, added to, passed into another function or assigned to a variable within the workspace. There are some exceptions. For example, the "save" function is not called because it produces a useful object within the workspace, but because it has the external effect (often called a side-effect) of saving an object to the disk. The same is true of 'print'. But aside from these exceptions, learning to think in terms of objects is essential to getting around at all in R.

Objects all have a class, which can be accessed using the "class" function. Numeric vectors are among the most commonly used objects in our work. In the following, you can see some of the ways in which numeric vectors are created and manipulated.

In [10]:
1:10 # prints 1:10, doesn't assign it to anything
class(1:10)

In [11]:
aa <- 1:10 # assign values using the <- or...
aa = 1:10  # ... a single = sign.
class(aa)

In [12]:
# the "c" operator is very important in R: it is used to concatenate values.
bb <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
class(bb) # note that these two are not of the same class; why not?

cc <- seq(from = 1, to = 10, by = 1) # another way of saying cc <- 1:10, but more flexible;
                                        # you can assign by 0.1 or 2 if you want different intervals

In [13]:
identical(aa, bb) # a logical test: is aa identical to bb?
                    # this test fails, because the classes of aa and bb are different 

In [14]:
aa == bb # a logical test: is each element of aa equal to its related element in bb?

In [15]:
all(aa == bb) # another logical test: do ALL of the elements of aa equal their related elements in bb?

In [16]:
aa = bb  # assignment: set aa to be identical to bb
aa <- bb # assignment

In [17]:
aa[2] # look at the second element of aa
aa[2] <- 15 # change the second element of aa

In [18]:
aa == bb
print(paste('multiplying:', aa * bb))
print(paste('adding:', aa + bb))
aa * 1.1
as.integer(aa * 1.1)

 [1] "multiplying: 1"   "multiplying: 30"  "multiplying: 9"   "multiplying: 16" 
 [5] "multiplying: 25"  "multiplying: 36"  "multiplying: 49"  "multiplying: 64" 
 [9] "multiplying: 81"  "multiplying: 100"
 [1] "adding: 2"  "adding: 17" "adding: 6"  "adding: 8"  "adding: 10"
 [6] "adding: 12" "adding: 14" "adding: 16" "adding: 18" "adding: 20"


In [19]:
names(aa) <- names(bb) <- c('Quercus', 'Carex', 'Acer', 'Daphnia',
                            'Granite', 'Pan', 'Peromyscus', 'Microtus', 'Shmoo', 'Caenorhabditis')
aa
message('You can subset aa using a number or a name indexing the position of the data you want')
aa['Quercus']

You can subset aa using a number or a name indexing the position of the data you want


### Matrices and data frames: start thinking about trait data
Trait data may be one-dimensional, in which case we can store them in named numeric vectors (look at aa above). But more often they are two-dimensional, tabular data, with individuals or taxa as the rows and traits as the columns. If your trait data are all of one data class---e.g., all numeric or all character data---a matrix is a very natural way of storing these data. 

In [20]:
dd <- matrix(1:12, 4, 3) # create a 4-row by 3-column matrix, filled by column with the numbers 1:12 
dd <- matrix(1:12, 4, 3, byrow = T) # create a 4-row by 3-column matrix, filled by row with the numbers 1:12 
dd <- matrix(NA, 5, 3, dimnames = list(c('harpo', 'groucho', 'zeppo', 'chico', 'gummo'), 
                                       c('instrument', 'hair', 'birthOrder'))) # create an empty matrix (cells filled with NA)
                                                                                # and name the rows and columns using the
                                                                                # dimnames argument

In [21]:
dd[1, 1] <- 'harp' # place a single value in row 1, column 1
dd[, 3] <- c(2, 3, 5, 1, 4) # fill column 3 with integers
dd[1,1] # look at the value in row 1, column 1

In [22]:
dd['harpo', 1] # look at the value in row 1, column 1

In [23]:
dd['harpo', 'instrument'] # look at the value in row 1, column 1

Often, however, your data will be composed of traits of different classes: one continuous, one an ordered categorical trait, one binary. In this case, you are better off using a data.frame object. If you read in your data using 'read.csv', 'read.delim', or 'read.table', you will end up with a data.frame object. The following lines explore some of the properties of data.frames and their relationship to matrices.

In [24]:
dd2 <- as.data.frame(dd)
dd[, 1]

In [25]:
dd2[, 1]

In [26]:
dimnames(dd2)

In [27]:
row.names(dd2)

In [28]:
names(dd2)

One of the oddities of data.frame objects is that they default to storing all data as factors. Factors are more often than not a bother, and they working with them can have odd side-effects. When I read in data from an external file using "read.csv" or "read.delim", I almost always use the "as.is = TRUE" argument to prevent R from coercing my data to factors. In the following lines, I convert a few columns to non-factor classes. I also demonstrate the use of $ and [[]] to index columns in a data.frame; neither can be used for matrices.

In [29]:
dd2$birthOrder

In [30]:
as.integer(dd2$birthOrder)

In [31]:
dd2$birthOrder <- as.integer(dd2$birthOrder)
dd2$instrument <- as.character(dd2$instrument)
for(i in 1:3) print(class(dd2[[i]]))

[1] "character"
[1] "factor"
[1] "integer"


A data.frame object can be indexed just like a matrix, as well:

In [32]:
marxFavs <- c('groucho', 'harpo', 'chico')
print(dd2[marxFavs, ])

        instrument hair birthOrder
groucho       <NA> <NA>          3
harpo         harp <NA>          2
chico         <NA> <NA>          1


### Lists: the most flexible data class
While matrices are composed of cells of all one data class, the columns of a data.frame may be of different data classes. The data.frame, it turns out, is just a special kind of list. Lists are collections of any kind of object, indexed using $ and [[]] or [], like a data.frame. In our line of work, the 'phylo' class is a simply a list that packages together a matrix, one or more numeric vectors, and character vectors to define a phylogenetic tree. 

In [33]:
as.list(dd2)
ee <- c(dd2, list(justABunchOfLetters = letters[1:16], anotherBunchOfLetters = LETTERS[18:2]))
ee$anotherBunchOfLetters

In [34]:
ee[3:4]

In [35]:
ee[[3]]

In [36]:
ee[3]

In [37]:
class(ee[[3]])

In [38]:
class(ee[3])

In [39]:
class(ee[3:4])

### How to find data
When your dataset is small, it's easy to manipulate, but if you want to find or modify things on the fly, you need to be able to search through your objects. The "grep" and "match" functions are greatly helpful in this regard.

In [40]:
grep('A', ee$anotherBunchOfLetters)

In [41]:
grep('rp', dd2$instrument)

In [42]:
grep('rp', dd2$instrument, value = T)

In [43]:
match(marxFavs, names(dd2))

### At the end of the day
Provided you have navigated to a new folder, you can end your session using "q()" or "quit()," and you'll be given the option to save your workspace. Do so: you'll find all your objects and history (up to 512 lines, unless you increase the default using "Sys.setenv(R_HISTSIZE = x)", where x is the number of lines you'd like to be saved in the history file). You can later double-click the workspace file if you are in windows or simply launch R from within the folder you've left (in Linux and, I think, Mac) to restore the worksession. Alternatively, you can reload it manually using "load(x)", where x is the name of your workspace. You can also save any objects you like in the workspace using "save"... see documentation for these.

In [44]:
# save()
# load()
# rm()
# quit(); q()

## Websites for additional information and examples
http://kembellab.ca/r-workshop/biodivR/SK_Biodiversity_R.html

http://www.mpcm-evolution.org/practice/online-practical-material-chapter-4/chapter-4-1-codes-used-produce-twelve-figures-chapter