-
Notifications
You must be signed in to change notification settings - Fork 8
/
amg.Rmd
8952 lines (5868 loc) · 491 KB
/
amg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Applied Population Genetics"
author: "Rodney J. Dyer"
description: "A textbook for the use of R in spatial genetic data analysis."
cover-image: "media/TitlePage_small.png"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: packages.bib
biblio-style: apalike
link-citations: yes
github-repo: "dyerlab/AMG"
---
# Applied Population Genetics {- .white_letters}
<a href="http://dyerlab.org">
```{r echo=FALSE}
knitr::include_graphics("media/APG-Title-Wide.png")
```
</a>
Build Date: **`r date()`**
The content of this text is modified, added to, and changed continually. You are welcome to use the content of this book in its present form and you can provide feedback, comments, or suggestions for additions by contacting me at [rjdyer@vcu.edu](mailto://rjdyer@vcu.edu). This work will continue to be hosted online and continually updated as new approaches and applications become available.
```{r echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
library(knitr)
theme_set( theme_bw(base_size=14) )
knitr::opts_chunk$set( warning=FALSE, message=FALSE,error = FALSE )
```
## Logistics {-}
© 2017 by R.J. Dyer.
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Under this license, authorized individuals may copy, distribute, display and perform the work and make derivative works and remixes based on this text only if they give the original author credit (attribution). You are also free to distribute derivative works only under a license identical ("not more restrictive") to the license that governs this original work.
Dr. Rodney Dyer is an Associate Professor and Director for the Center for Environmental Studies at Virginia Commonwealth University in Richmond, Virginia, USA. His research focuses on genetic connectivity and structure and how the environment influences both. More information on his research can be found at [http://dyerlab.org](http://dyerlab.org).
## Motivation {-}
This is, most of all, not a book about R. This is also not a "Population Genetics in R" textbook. It is a book about how we do population genetic analyses, for which R is a tool that allows us to reach beyond the limitations of point-and-click interfaces. As a field, Population Genetics has a broad set of textbooks describing the underlying theory. As a student, I cut my teeth on the texts of Hartl (1981), Hartl & Clark (1997) and have used other great texts such as Hamilton (2011) and Hedrick (2009) in the classroom to teach the subject for the last decade. In late 2015, there are a host of texts available to the student of population genetics—amazon lists 150 different books under the search term “Population Genetics textbook"—why do another one? What I have found is that while the theory behind this discipline has been well developed, its application has been largely neglected.
As a new graduate student, fresh out of my first population genetics course, I felt armed with the understanding of how microevolutionary processes influence the distribution of alleles within and among populations. What I wasn't prepared for was sitting in front of the computer, looking at a few thousand individuals assayed for several different loci and actually ‘doing' population genetics. All of those textbooks provide me with what is expected and the theory behind it, though often fall short on teaching me how I could apply those inferences to data I actually collect. If you are a theoretical population geneticist, those texts and your ability to integrate mathematical equations will provide you a research lifetime of work. However, if you are practitioner who uses population genetic tools to answer conservation, management, or ecologically inspired questions, the evolutionary expectations of population genetic processes will most likely not be as important as directly estimating inbreeding, exploring ongoing connectivity, or determining genetic granularity of existing populations. This is where this textbook is focusing, a seemingly uninhabited niche in the knowledge ecosystem of graduate level population genetics.
This text was developed out of a graduate course in Population Genetics that I've been teaching at Virginia Commonwealth University since 2005. This texts uses R and many additional libraries available within the R ecosystem to illustrate how to perform specific types of analyses and what kind of biological inferences we can gain from them. In the process, we cover materials that are commonly needed in the application of population genetic analysis such as spatial autocorrelation, paternity analysis, and the use of permutation while at the same time highlighting logistical challenges commonly encountered in analyzing real data such as incomplete sampling, missing data, and rarefaction.
## Typography {-}
This text is designed primarily as an electronic publication (ePub) and has dynamical content included within. If you are reading a static copy (as PDF or printed text), you are not getting the full experience of what this book has been designed to deliver. Much of the text is devoted to quantitative analysis of data in R, and as such I've typeset the R components differently from the flowing text. Text intended to be input into R is typeset as a fixed width font and colored using the default color scheme found in RStudio (http://rstudio.org). Here are two input examples.
```{r}
value <- 20
name <- "Perdedor"
```
This text is amenable to copy-and-paste action so you can perform the same calculations on your computer as are done in the text. When R returns an answer to an analysis or prints the contents of a variable out, the results are also typeset in fixed width font but each line is prefixed by two hash marks.
```{r}
rnorm(10)
```
The lines with hashes are not something you are going to copy-and-paste into your R session, it is what R is giving you. Inline code (e.g., code inserted into a sentence in a descriptive context such as discussing the `rnorm()` function in the previous example) is typeset similarly though not necessarily intended to be cut-and-pasted into your session.
Throughout the ePub, there are also dynamical content. Some of this content may be condensed text put into a scrolling text-box. Here the notion is to provide direct access to information without unnecessarily adding length to the overall text. Here is an example, showing the documentation associated with the R `help.start()` function within a scrolling text box.
<div class="scrollingbox"><pre>
help.start {utils} R Documentation
Hypertext Documentation
Description
Start the hypertext (currently HTML) version of R's online documentation.
Usage
help.start(update = FALSE, gui = "irrelevant", browser = getOption("browser"), remote = NULL)
Arguments
update - logical: should this attempt to update the package index to reflect the currently available packages. (Not attempted if remote is non-NULL.)
gui - just for compatibility with S-PLUS.
browser - the name of the program to be used as hypertext browser. It should be in the PATH, or a full path specified. Alternatively, it can be an R function which will be called with a URL as its only argument. This option is normally unset on Windows, when the file-association mechanism will be used.
remote - A character string giving a valid URL for the ‘R_HOME' directory on a remote location.
Details
Unless remote is specified this requires the HTTP server to be available (it will be started if possible: see startDynamicHelp).
One of the links on the index page is the HTML package index, ‘R.home("docs")/html/packages.html', which can be remade by make.packages.html(). For local operation, the HTTP server will remake a temporary version of this list when the link is first clicked, and each time thereafter check if updating is needed (if .libPaths has changed or any of the directories has been changed). This can be slow, and using update = TRUE will ensure that the packages list is updated before launching the index page.
Argument remote can be used to point to HTML help published by another R installation: it will typically only show packages from the main library of that installation.
See Also
help() for on- and off-line help in other formats.
browseURL for how the help file is displayed.
RSiteSearch to access an on-line search of R resources.
Examples
help.start()
## Not run:
## the 'remote' arg can be tested by
help.start(remote = paste0("file://", R.home()))
## End(Not run)
</pre></div>
In addition to shoving static content into scrolling windows, longer R scripts will also be inserted into these widgets so that space can be saved while preserving syntax highlighting and code format.
## Interactive Content {-}
Where possible, I have included interactive content in the text. Examples include dynamical plots such as embedding google map objects in the page, network structure that can be manipulated in the browser, and widgets that show how a process influences popualtion genetic features by allowing you to direclty manipulate the parameters in the model.
## Dedication {-}
I would like to dedicate this text and the motivations and inspirations that underly its creation, to my thesis advisor, Dr. Victoria L. Sork. She has an uncanny ability to see beyond the questions that we know how to answer and to help us focus on the answers that we are really trying to find.
<!--chapter:end:index.rmd-->
# The R Ecosystem {.imageChapter}
<div class="chapter_image"><img src="chapter_images/ch_farm.jpg"></div>
> R is an open-source, community driven platform available for a wide variety of computer operating systems and is becoming a de facto standard in modern biological analyses. It is not a point-and-click interface, rather it is a rich analytical ecosystem that will make your research and scientific life more pleasurable.
## Getting R Configured
The grammar of the R language was derived from another system called S-Plus. S-Plus was a proprietary analysis platform developed by AT&T and licenses were sold for its use, mostly in industry and education. Given the closed nature of the S-Plus platform, R was developed with the goal of creating an interpreter that could read grammar similar to S-Plus but be available to the larger research community. The use of R has increased dramatically due to its open nature and the ability of people to share code solutions with relatively little barriers.
The main repository for R is located at the [CRAN Repository](http://cran.r-project.org), which is where you can download the latest version. It is in your best interests to make sure you update the underlying R system, changes are made continually (perhaps despite the outward appearance of the website).
```{r echo=FALSE}
knitr::include_graphics("./media/CRAN.png")
```
The current version of this book uses version 3.2. To get the correct version, open the page and there should be a link at the top for your particular computing platform. Download the appropriate version and install it following the instructions appropriate for your computer.
### Packages
The base R system comes with enough functionality to get you going. By default, there is only a limited amount of functionality in R, which is a great thing. You only load and use the packages that you intend to use. There are just too many packages to have them all loaded into memory at all times and there is such a broad range of packages, it is not likely you'd need more than a small fraction of the packages during the course of all your analyses.
Once you have a package, you can tell R that you intend to use it by either
```{r eval=FALSE}
library(package_name)
```
or
```{r eval=FALSE}
library(package_name)
```
They are approximately equivalent, differing in only that the second one actually returns a TRUE or FALSE indicating the presence of that library on you machine. If you do not want to be mocked by other users of R, I would recommend the `library` version---there are situations where `require` will not do what you think you want it to do (even though it is a verb and should probably be the correct one to use grammatically).
There are, at present, a few different places you can get packages. The packages can either be downloaded from these online repositories and installed into R or you can install them from within R itself. Again, I'll prefer the latter as it is a bit more straightforward.
### CRAN
The main repository for packages is hosted by the r-project page itself. There are packages with solutions to analyses ranging from Approximate Bayesian Computation to Zhang & Pilon's approaches to characterizing climatic trends. The list of these packages is large and ever growing. It can be found on CRAN under the packages menu. To install a package from this repository, you use the function
```{r eval=FALSE}
install.packages("thePackageName")
```
You can see that R went to the CRAN mirror site (I use the rstudio one), downloaded the package, look for particular dependencies that that package may have and download them as well for you. It should install these packages for you and give you an affirmative message indicating it had done so.
At times, there are some packages that are not available in binary form for all computer systems (the rgdal package is one that comes to mind for which we will provide a work around later) and the packages need to be compiled. This means that you need to have some additional tools and/or libraries on your machine to take the code, compile it, and link it together to make the package. In these cases, the internet and the documentation that the developer provide are key to your sanity.
### GitHub
There are an increasing number of projects that are being developed either in the open source community or shared among a set of developers. These projects are often hosted on http://www.github.com where you can get access to the latest code updates. The packages that I develop are hosted on Github at (http://github.com/dyerlab) and only pushed to CRAN when I make major changes.
```{r echo=FALSE}
knitr::include_graphics("./media/DyerlabGithub.png")
```
To install packages from Github you need to install the devtools library from CRAN first
```{r eval=FALSE}
install.packages("devtools")
library(devtools)
```
Then you need to use the devtools function `install_github()` to go grab it. To do so you need two separate pieces of information, the name of the developer who is creating the repository and the name of the repository it is contained within. For the gstudio package, the develop is `dyerlab` and the repository name is `gstudio`. If you are comfortable with git, you can also check out certain branches of development if you like with optional arguments to the `install_github()` function. Here is how you would install both the `gstudio` and `popgraph` libraries I use in this book.
```{r eval=FALSE}
install_github("dyerlab/gstudio")
install_github("dyerlab/popgraph")
```
If you are starting to work with R and intend to compile packages, there are some tools that you will need to have on your machine. For Windows machines, there is an rtools download EXE that you can get. On OSX, you need to download the developers tools from Apple (available on the Application Store for free). In either case, the developer (if they care about their code being used by others) should provide sufficient documentation. If you are working on Windows, I do not know how that system (and never have used it for any prolonged period of time) so you'll have to look to the internet on where to find compilers and other developer tools.
### Bioconductor
The last main location to find packages is on the Bioconductor site, a collection of software for bioinformatic analyses. To install from the bioconductor site you need to download their own installer as:
```{r eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite()
```
And then to install packages you use
```{r eval=FALSE}
biocLite(c("GenomicFeatures", "AnnotationDbi"))
```
There are no libraries used in this text from bioconductor as this text is more focused on marker-based population genetics and not sequences, annotations, etc.
### Troublesome Packages
Some packages provide a unique set of problems for getting them onto your computer. There are a variety of reasons for this and Google is your friend (though it would be easier if R was named something more unique than the 18^^th^^ letter of the alphabet...).
#### RGDAL & RGEOS {#rgdal-rgeos-packages}
Every time I upgrade in any significant way, two R libraries seem to raise their ugly heads and scream like a spoiled child—rgdal and rgeos. Why do these two have to be SOOOO much of a pain? Why can't we have a auto build of a binary with all the options in it for OSX? Who knows? I always feel like I get the fuzzy end of the lollipop with these two. Here is my latest approach for getting them going.
First you have to make sure you have the latest GDAL libraries. I used to get mine from [Kyngchaos](http://www.kyngchaos.com/software/frameworks), just download the framework, install it, and then do some kind of long R CMD INSTALL dance, which seems to no longer work for me. I also tried installing from Ripley's repository and found that (a) It was a version older than the one I already had on my machine, and (b) you can't install from that repository, there is a malformed header and the install.packages() function just barfs.
Time to try something new. I typically stay away from the various installer frameworks out there on OSX to keep everything in Frameworks. But this time, I used MacPorts. You can find the latest version here. Here is how I got it to help me out.
- Download XCode from Apple, it is both free and has a lot of tools that make your life as a programmer easier. It is a big package and you'll have to install the command line developer tools as well. You will be prompted on how to do this.
- Downloaded the version of macports for your OS, I'm currently on 10.11 and installed it with little problems. It takes a bit of time at the end of the installation because it is downloading a lot of information. Be patient.
- In the terminal, I updated it `sudo ports -v selfupdate` and again be patient, it is going to grab a lot of stuff from the internet.
- I then used it to install `gdal` as a unix library (rather than as a framework so it won't be located in /Library/Frameworks) by sudo ports install `gdal`. There were a lot of dependencies for this one so it took a while.
- I then had R install rgdal as `install.packages( rgdal, type="source")`
Worked like a charm.
## Libraries Used in Text
This work requires several libraries that you may need to get from either CRAN or GitHub. The following if you run the following code, you should be up-to-date on the necessary packages used throughout this text.
```{r echo=FALSE, warning=FALSE}
files <- list.files(".",pattern=".rmd")
libraries <- c("gstudio","popgraph","ggplot2")
for( file in files ) {
suppressWarnings(s <- system( paste("grep -E -o 'library\\(\\w+\\)'",file), intern=TRUE))
if( length(s) > 0 ) {
s <- strsplit(s,"(",fixed=TRUE)
for( item in s ){
if( length(item) == 2){
library <- strsplit(item[2],")",fixed=TRUE)[[1]]
libraries <- c(libraries,library)
}
}
}
}
pkgs <- sort( unique(libraries) )
idx <- which( pkgs == "package_name" )
pkgs <- pkgs[-idx]
pkgs_df <- data.frame(Name = pkgs, Title = NA)
for(i in seq_along(pkgs)){
f = system.file(package = pkgs[i], "DESCRIPTION")
if( nchar(f)> 1) {
# Title is always on 3rd line
title = readLines(f)
title = title[grep("Title: ", title)]
pkgs_df$Title[i] = gsub("Title: ", "", title)
}
}
knitr::kable(pkgs_df,caption = "R packages used in the examples shown in this book.")
# make sure these libraries are already installed on this machine
inst_pkgs <- installed.packages()
to_install <- setdiff( pkgs, inst_pkgs )
if( length(to_install))
install.packages(to_install, repos="https://cran.rstudio.org")
# See if maxent.jar is installed in the dismo java package.
path <- system.file("java", package="dismo")
files <- list.files(system.file("java", package="dismo"))
if( !("maxent.jar" %in% files) )
system( paste("cp ./spatial_data/maxent.jar", path) )
```
## The RStudio Environment
By itself, R can be used in the terminal or as the basic interface the installer provides. While both of these methods are sufficient for working in R, they are less than optimal (IMHO). I believe one of the most important tools you can use is a good IDE as it helps you organize and work with your data in ways that focus more on the outcome and less on the implementation. I've spent a lot of time in both Emacs and Vim and while they may be tools for the geek elite, RStudio has made me much more productive in terms of output per unit time.
```{r echo=FALSE}
knitr::include_graphics("./media/RStudioIDE.png")
```
The RStudio IDE can be can be downloaded directly from (http://rstudio.org) and comes in varieties for desktops and servers. The packages are easy to install and contain all the instructions you need. If you run your own server, you can install it and do your analyses via a web interface that looks identical to the desktop client (in fact it is more similar than you perhaps know).
The interface has four panes, two of which (the source code editor and the terminal output) you will work with the most often. The other panes have information on the history of your project, plot output, packages, help, and a file browser. You can type in R commands in the terminal window and receive responses directly, just as normal. However, you can also type in your code into a script file (files ending in *.R) and run them directly. You can also create Markdown and LaTeX documents embedding your code in with the verbiage, which is evaluated any time you make a html, rtf, docx, or pdf output. All the code from R for this book was written in RMarkdown and parsed as an html document. If you are a serious user of R, you can use LaTeX or RMarkdown to make documents, slides, presentations, posters, etc. It is a versatile tool and one worth looking into, especially as it pertains to reproducible research (something we all need to strive for).
<!--chapter:end:r_ecosystem.rmd-->
# Data Types {.imageChapter}
<div class="chapter_image"><img src="chapter_images/ch_milipedes.jpg"></div>
> The data we work with comes in many forms—integers, stratum, categories, genotypes, etc.—all of which we need to be able to work with in our analyses. In this chapter, the basic data types we will commonly use in population genetic analyses. This section covers some of the basic types of data we will use in R. These include numbers, character, factors, and logical data types. We will also introduce the locus object from the gstudio library and see how it is just another data type that we can manipulate in R.
The very first hurdle you need to get over is the oddness in the way in which R assigns values to variables.
```{r eval=FALSE}
variable <- value
```
Yes that is a less-than and dash character. This is the assignment operator that historically has been used and it is the one that I will stick with. In some cases you can use the ‘=' to assign variables instead but then it takes away the R-ness of R itself. For decision making, the equality operator (e.g., is this equal to that) is the double equals sign ‘=='. We will get into that below where we talk about logical types and later in decision making.
If you are unaware of what type a particular variable may be, you can always use the `type()` function and R will tell you.
```{r eval=FALSE}
class( variable )
```
R also has a pretty good help system built into itself. You can get help for any function by typing a question mark in front of the function name. This is a particularly awesome features because at the end of the help file, there is often examples of its usage, which are priceless. Here is the documentation for the ‘help' function as given by:
```{r eval=FALSE}
?help
```
There are also package vignettes available (for most packages you download) that provide additional information on the routines, data sets, and other items included in these packages. You can get a list of vignettes currently installed on your machine by:
```{r eval=FALSE}
vignette()
```
and vignettes for a particular package by passing the package name as an argument to the function itself.
## Numeric Data Types
The quantitative measurements we make are often numeric, in that they can be represented as as a number with a decimal component (think weight, height, latitude, soil moisture, ear wax viscosity, etc.). The most basic type of data in R, is the numeric type and represents both integers and floating point numbers (n.b., there is a strict integer data type but it is often only needed when interfacing with other C libraries and can for what we are doing be disregarded).
Assigning a value to a variable is easy
```{r}
x <- 3
x
```
By default, R automatically outputs whole numbers numbers within decimal values appropriately.
```{r}
y <- 22/7
y
```
If there is a mix of whole numbers and numbers with decimals together in a container such as
```{r}
c(x,y)
```
then both are shown with decimals. The `c()` part here is a function that combines several data objects together into a vector and is very useful. In fact, the use of vectors are are central to working in R and functions almost all the functions we use on individual variables can also be applied to vectors.
A word of caution should be made about numeric data types on any computer. Consider the following example.
```{r}
x <- .3 / 3
x
```
which is exactly what we'd expect. However, the way in which computers store decimal numbers plays off our notion of significant digits pretty well. Look what happens when I print out x but carry out the number of decimal places.
```{r}
print(x, digits=20)
```
Not quite 0.1 is it? Not that far away from it but not exact. That is a general problem, not one that R has any more claim to than any other language and/or implementation. Does this matter much, probably not in the realm of the kinds of things we do in population genetics, it is just something that you should be aware of.
You can make random sets of numeric data by using using functions describing various distributions. For example, some random numbers from the normal distribution are:
```{r}
rnorm(10)
```
from the normal distribution with designated mean and standard deviation:
```{r}
rnorm(10,mean=42,sd=12)
```
A poisson distribution with mean 2:
```{r}
rpois(10,lambda = 2)
```
and the $\chi^2$ distribution with 1 degree of freedom:
```{r}
rchisq(10, df=1)
```
There are several more distributions that if you need to access random numbers, quantiles, probability densities, and cumulative density values are available.
### Coercion to Numeric
All data types have the potential ability to take another variable and coerce it into their type. Some combinations make sense, and some do not. For example, if you load in a CSV data file using read_csv(), and at some point a stray non-numeric character was inserted into one of the cells on your spreadsheet, R will interpret the entire column as a character type rather than as a numeric type. This can be a very frustrating thing, spreadsheets should generally be considered evil as they do all kinds of stuff behind the scenes and make your life less awesome.
Here is an example of coercion of some data that is initially defined as a set of characters
```{r}
x <- c("42","99")
x
```
and is coerced into a numeric type using the as.numeric() function.
```{r}
y <- as.numeric( x )
y
```
It is a built-in feature of the data types in R that they all have (or should have if someone is producing a new data type and is being courteous to their users) an `as.X()` function. This is where the data type decides if the values asked to be coerced are reasonable or if you need to be reminded that what you are asking is not possible. Here is an example where I try to coerce a non-numeric variable into a number.
```{r}
x <- "The night is dark and full of terrors..."
as.numeric( x )
```
By default, the result should be NA (missing data/non-applicable) if you ask for things that are not possible.
## Characters
A collection of letters, number, and or punctuation is represented as a character data type. These are enclosed in either single or double quotes and are considered a single entity. For example, my name can be represented as:
```{r}
prof <- "Rodney J. Dyer"
prof
```
In R, character variables are considered to be a single entity, that is the entire prof variable is a single unit, not a collection of characters. This is in part due to the way in which vectors of variables are constructed in the language. For example, if you are looking at the length of the variable I assigned my name to you see
```{r}
length(prof)
```
which shows that there is only one ‘character' variable. If, as is often the case, you are interested in knowing how many characters are in the variable prof, then you use the
```{r}
nchar(prof)
```
function instead. This returns the number of characters (even the non-printing ones like tabs and spaces.
```{r}
nchar(" \t ")
```
As all other data types, you can define a vector of character values using the `c()` function.
```{r}
x <- "I am"
y <- "not"
z <- 'a looser'
terms <- c(x,y,z)
terms
```
And looking at the `length()` and `nchar()` of this you can see how these operations differ.
```{r}
length(terms)
nchar(terms)
```
### Concatenation of Characters
Another common use of characters is concatenating them into single sequences. Here we use the function `paste()` and can set the separators (or characters that are inserted between entities when we collapse vectors). Here is an example, entirely fictional and only provided for instructional purposes only.
```{r}
paste(terms, collapse=" ")
```
```{r}
paste(x,z)
```
```{r}
paste(x,z,sep=" not ")
```
### Coercion to Characters
A character data type is often the most basal type of data you can work with. For example, consider the case where you have named sample locations. These can be kept as a character data type or as a factor (see below). There are benefits and drawbacks to each representation of the same data (see below). By default (as of the version of R I am currently using when writing this book), if you use a function like read_table() to load in an external file, columns of character data will be treated as factors. This can be good behavior if all you are doing is loading in data and running an analysis, or it can be a total pain in the backside if you are doing more manipulative analyses.
Here is an example of coercing a numeric type into a character type using the `as.character()` function.
```{r}
x <- 42
x
```
```{r}
y <- as.character(x)
y
```
## Factors
A factor is a categorical data type. If you are coming from SAS, these are class variables. If you are not, then perhaps you can think of them as mutually exclusive classifications. For example, an sample may be assigned to one particular locale, one particular region, and one particular species. Across all the data you may have several species, regions, and locales. These are finite, and defined, sets of categories. One of the more common headaches encountered by people new to R is working with factor types and trying to add categories that are not already defined.
Since factors are categorical, it is in your best interest to make sure you label them in as descriptive as a fashion as possible. You are not saving space or cutting down on computational time to take shortcuts and label the locale for Rancho Santa Maria as RSN or pop3d or 5. Our computers are fast and large enough, and our programmers are cleaver enough, to not have to rename our populations in numeric format to make them work (hello STRUCTURE I'm calling you out here). The only thing you have to loose by adopting a reasonable naming scheme is confusion in your output.
To define a factor type, you use the function `factor()` and pass it a vector of values.
```{r}
region <- c("North","North","South","East","East","South","West","West","West")
region <- factor( region )
region
```
When you print out the values, it shows you all the levels present for the factor. If you have levels that are not present in your data set, when you define it, you can tell R to consider additional levels of this factor by passing the optional levels= argument as:
```{r}
region <- factor( region, levels=c("North","South","East","West","Central"))
region
```
If you try to add a data point to a factor list that does not have the factor that you are adding, it will give you an error (or ‘barf' as I like to say).
```{r}
region[1] <- "Bob"
```
Now, I have to admit that the Error message in its entirety, with its `“[<-.factor`(`*tmp*`, 1, value = "Bob")"` part is, perhaps, not the most informative. Agreed. However, the “invalid factor level" does tell you something useful. Unfortunately, the programmers that put in the error handling system in R did not quite adhere to the spirit of the “fail loudly" mantra. It is something you will have to get good at. Google is your friend, and if you post a questions to (http://stackoverflow.org) or the R user list without doing serious homework, put on your asbestos shorts!
Unfortunately, the error above changed the first element of the region vector to NA (missing data). I'll turn it back before we move too much further.
```{r}
region[1] <- "North"
```
Factors in R can be either unordered (as say locale may be since locale A is not `>`, `=`, or `<` locale B) or they may be ordered categories as in `Small < Medium < Large < X-Large`. When you create the factor, you need to indicate if it is an ordered type (by default it is not). If the factors are ordered in some way, you can also create an ordination on the data. If you do not pass a levels= option to the `factors()` function, it will take the order in which they occur in data you pass to it. If you want to specify an order for the factors specifically, pass the optional `levels=` and they will be ordinated in the order given there.
```{r}
region <- factor( region, ordered=TRUE, levels = c("West", "North", "South", "East") )
region
```
### Missing Levels in Factors
There are times when you have a subset of data that do not have all the potential categories.
```{r}
subregion <- region[ 3:9 ]
subregion
```
```{r}
table( subregion )
```
## Logical Types
A logical type is either TRUE or FALSE, there is no in-between. It is common to use these types in making decisions (see if-else decisions) to check a specific condition being satisfied. To define logical variables you can either use the TRUE or FALSE directly
```{r}
canThrow <- c(FALSE, TRUE, FALSE, FALSE, FALSE)
canThrow
```
or can implement some logical condition
```{r}
stable <- c( "RGIII" == 0, nchar("Marshawn") == 8)
stable
```
on the variables. Notice here how each of the items is actually evaluated as to determine the truth of each expression. In the first case, the character is not equal to zero and in the second, the number of characters (what `nchar()` does) is indeed equal to 8 for the character string “Marshawn".
It is common to use logical types to serve as indices for vectors. Say for example, you have a vector of data that you want to select some subset from.
```{r}
data <- rnorm(20)
data
```
Perhaps you are on interested in the non-negative values
```{r}
data[ data > 0 ]
```
If you look at the condition being passed to as the index
```{r}
data > 0
```
you see that individually, each value in the data vector is being evaluated as a logical value, satisfying the condition that it is strictly greater than zero. When you pass that as indices to a vector it only shows the indices that are `TRUE`.
You can coerce a value into a logical if you understand the rules. Numeric types that equal 0 (zero) are `FALSE`, always. Any non-zero value is considered `TRUE`. Here I use the modulus operator, `%%`, which provides the remainder of a division.
```{r}
1:20 %% 2
```
which used as indices give us
```{r}
data[ (1:20 %% 2) > 0 ]
```
You can get as complicated in the creation of indices as you like, even using logical operators such as OR and AND. I leave that as an example for you to play with.
<!--chapter:end:datatypes.rmd-->
# Data Containers {.imageChapter}
<div class="chapter_image"><img src="chapter_images/ch_turtle.jpg"></div>
We almost never work with a single datum^[The word *data* is plural, datum is singular], rather we keep lots of data. Moreover, the kinds of data are often heterogeneous, including categorical (Populations, Regions), continuous (coordinates, rainfall, elevation), imagry (hyperspectral, LiDAR), and perhaps even genetic. R has a very rich set of containers into which we can stuff our data as we work with it. Here these container types are examined and the restrictions and benefits associated with each type are explained.
## Vectors
We have already seen several examples of several vectors in action (see the introduction to Numeric data types for example). A vector of objects is simply a collection of them, often created using the `c()` function (*c* for combine). Vectorized data is restricted to having homogeneous data types---you cannot mix character and numeric types in the same vector. If you try to mix types, R will either coerce your data into a reasonable type
```{r}
x <- c(1,2,3)
x
y <- c(TRUE,TRUE,FALSE)
y
z <- c("I","am","not","a","looser")
z
```
or coearce them into one type that is amenable to all the types of data that you have given it. In this example, a Logical, Character, Constant, and Function are combined resulting in a vector output of type Character.
```{r}
w <- c(TRUE, "1", pi, ls())
w
class(w)
```
Accessing elements within a vector are done using the square bracket `[]` notation. All indices (for vectors and matrices) start at 1 (not zero as is the case for some languages). Getting and setting the components within a vector are accomplished using numeric indices with the assignment operators just like we do for variables containing a single value.
```{r}
x
x[1] <- 2
x[3] <- 1
x
x[2]
```
A common type of vector is that of a sequences. We use sequences all the time, to iterate through a list, to counting generations, etc. There are a few ways to generate sequences, depending upon the step sequence. For a sequence of whole numbers, the easiest is through the use of the colon operator.
```{r}
x <- 1:6
x
```
This provides a nice shorthand for getting the values X:Y from X to Y, inclusive. It is also possible to go backwards using this operator, counting down from X to Y as in:
```{r}
x <- 5:2
x
```
The only constraint here is that we are limited to a step size of 1.0. It is possible to use non-integers as the bounds, it will just count up by 1.0 each time.
```{r}
x <- 3.2:8.4
x
```
If you are interested in making a sequence with a step other than 1.0, you can use the `seq()` function. If you do not provide a step value, it defaults to 1.0.
```{r}
y <- seq(1,6)
y
```
But if you do, it will use that instead.
```{r}
z <- seq(1,20,by=2)
z
```
It is also possible to create a vector of objects as repetitions using the `rep()` (for repeat) function.
```{r}
rep("Beetlejuice",3)
```
If you pass a vector of items to `rep()`, it can repeat these as either a vector being repeated (the default value)
```{r}
x <- c("No","Free","Lunch")
rep(x,time=3)
```
or as each item in the vector repeated.
```{r}
rep(x,each=3)
```
## Matrices
A matrix is a 2- or higher dimensional container, most commonly used to store numeric data types. There are some libraries that use matrices in more than two dimensions (rows and columns and sheets), though you will not run across them too often. Here I restrict myself to only 2-dimensional matrices.
You can define a matrix by giving it a set of values and an indication of the number of rows and columns you want. The easiest matrix to try is one with empty values:
```{r}
matrix(nrow=2, ncol=2)
```
Perhaps more useful is one that is pre-populated with values.
```{r}
matrix(1:4, nrow=2 )
```
Notice that here, there were four entries and I only specified the number of rows required. By default the ‘filling-in' of the matrix will proceed down column (*by-column*). In this example, we have the first column with the first two entries and the last two entries down the second column. If you want it to fill by row, you can pass the optional argument
```{r}
matrix(1:4, nrow=2, byrow=TRUE)
```
and it will fill *by-row*.
When filling matrices, the default size and the size of the data being added to the matrix are critical. For example, I can create a matrix as:
```{r}
Y <- matrix(c(1,2,3,4,5,6),ncol=2,byrow=TRUE)
Y
```
or
```{r}
X <- matrix(c(1,2,3,4,5,6),nrow=2)
X
```
and both produce a similar matrix, only transposed.
```{r}
X == t(Y)
```
In the example above, the number of rows (or columns) was a clean multiple of the number of entries. However, if it is not, R will fill in values.
```{r}
X <- matrix(c(1,2,3,4,5,6),ncol=4, byrow=TRUE)
```
Notice how you get a warning from the interpreter. But that does not stop it from filling in the remaining slots by starting over in the sequence of numbers you passed to it.
```{r}
X
```
The dimensionality of a matrix (and `data.frame` as we will see shortly) is returned by the `dim()` function. This will provide the number of rows and columns as a vector.
```{r}
dim(X)
```
Accessing elements to retrieve or set their values within a matrix is done using the square brackets just like for a vector but you need to give `[row,col]` indices. Again, these are 1-based so that
```{r}
X[1,3]
```
is the entry in the 1st row and 3rd column.
You can also use ‘slices' through a matrix to get the rows
```{r}
X[1,]
```
or columns
```{r}
X[,3]
```
of data. Here you just omit the index for the entity you want to span. Notice that when you grab a slice, even if it is a column, is given as a vector.
```{r}
length(X[,3])
```
You can grab a sub-matrix using slices if you give a range (or sequence) of indices.
```{r}
X[,2:3]
```
If you ask for values from a matrix that exceed its dimensions, R will give you an error.
```{r,eval=FALSE}
X[1,8]
```
```
## Error in X[1, 8] : subscript out of bounds
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```
There are a few cool extensions of the `rep()` function that can be used to create matrices as well. They are optional values that can be passed to the function.
* `times=x`: This is the default option that was occupied by the ‘3' in the example above and represents the number of times that first argument will be repeated.
* `each=x` This will take each element in the first argument are repeat them `each` times.
* `length.out=x`: This make the result equal in length to `x`.
In combination, these can be quite helpful. Here is an example using numeric sequences in which it is necessary to find the index of all entries in a 3x2 matrix. To make the indices, I bind two columns together using `cbind()`. There is a matching row binding function, denoted as `rbind()` (perhaps not so surprisingly). What is returned is a matrix
```{r}
indices <- cbind( rep(1:2, each=3), rep(1:3,times=2), rep(5,length.out=6) )
indices
```
## Lists
A list is a type of vector but is indexed by ‘keys' rather than by numeric indices. Moreover, lists can contain heterogeneous types of data (e.g., values of different `class`), which is not possible in a vector type. For example, consider the list
```{r}
theList <- list( x=seq(2,40, by=2), dog=LETTERS[1:5], hasStyle=logical(5) )
summary(theList)
```
which is defined with a numeric, a character, and a logical component. Each of these entries can be different in length as well as type. Once defined, the entries may be observed as:
```{r}
theList
```
Once created, you can add variables to the list using the $-operator followed by the name of the key for the new entry.
```{r}
theList$my_favoriate_number <- 2.9 + 3i
```
or use double brackets and the name of the variable as a character string.
```{r}
theList[["lotto numbers"]] <- rpois(7,lambda=42)
```
The keys currently in the list are given by the `names()` function
```{r}
names(theList)
```
Getting and setting values within a list are done the same way using either the `$`-operator
```{r}
theList$x
theList$x[2] <- 42
theList$x
```
or the double brackets
```{r}
theList[["x"]]
```
or using a numeric index, but that numeric index is looks to the results of `names()` to figure out which key to use.
```{r}
theList[[2]]
```
The use of the double brackets in essence provides a direct link to the variable in the list whose name is second in the `names()` function (*dog* in this case). If you want to access elements within that variable, then you add a second set of brackets on after the double ones.
```{r}
theList[[1]][3]
```
This deviates from the matrix approach as well as from how we access entries in a `data.frame` (described next). It is not a single square bracket with two indices, that gives you an error:
```{r eval=FALSE}
theList[1,3]
```
```
## Error in theList[1, 3] : incorrect number of dimensions
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```
List are rather robust objects that allow you to store a wide variety of data types (including nested lists). Once you get the indexing scheme down, it they will provide nice solutions for many of your computational needs.
## Data Frames
The `data.frame` is the default data container in R. It is analogous to both a spreadsheet, at least in the way that I have used spreadsheets in the past, as well as a database. If you consider a single spreadsheet containing measurements and observations from your research, you may have many columns of data, each of which may be a different kind of data. There may be `factors` representing designations such as species, regions, populations, sex, flower color, etc. Other columns may contain numeric data types for items such as latitude, longitude, dbh, and nectar sugar content. You may also have specialized columns such as dates collected, genetic loci, and any other information you may be collecting.
On a spreadsheet, each column has a unified data type, either quantified with a value or as a missing value, `NA`, in each row. Rows typically represent the sampling unit, perhaps individual or site, along which all of these various items have been measured or determined. A `data.frame` is similar to this, at least conceptually. You define a `data.frame` by designating the columns of data to be used. You do not need to define all of them, more can be added later. The values passed can be sequences, collections of values, or computed parameters. For example:
```{r}
df <- data.frame( ID=1:5, Names=c("Bob","Alice","Vicki","John","Sarah"), Score=100 - rpois(5,lambda=10))
df
```
You can see that each column is a unified type of data and each row is equivalent to a record. Additional data columns may be added to an existing data.frame as:
```{r}
df$Passed_Class <- c(TRUE,TRUE,TRUE,FALSE,TRUE)
```
Since we may have many (thousands?) of rows of observations, a `summary()` of the data.frame can provide a more compact description.
```{r}
summary(df)
```
We can add columns of data to the data.frame after the fact using the `$`-operator to indicate the column name. Depending upon the data type, the summary will provide an overview of what is there.
### Indexing Data Frames
You can access individual items within a `data.frame` by numeric index such as:
```{r}
df[1,3]
```
You can slide indices along rows (which return a new `data.frame` for you)
```{r}
df[1,]
```
or along columns (which give you a vector of data)
```{r}
df[,3]
```
or use the `$`-operator as you did for the list data type to get direct access to a either all the data or a specific subset therein.
```{r}
df$Names[3]
```
Indices are ordered just like for matrices, rows first then columns. You can also pass a set of indices such as:
```{r}
df[1:3,]
```
It is also possible to use logical operators as indices. Here I select only those names in the data.frame whose score was >90 and they passed popgen.
```{r}
df$Names[df$Score > 90 & df$Passed_Class==TRUE]
```
This is why `data.frame` objects are very database like. They can contain lots of data and you can extract from them subsets that you need to work on. This is a VERY important feature, one that is vital for reproducible research. Keep you data in one and only one place.
<!--chapter:end:r_containers.rmd-->
# Programming {.imageChapter}