amg.Rmd

--- 
title: "Applied Population Genetics"
author: "Rodney J. Dyer"
description: "A textbook for the use of R in spatial genetic data analysis."
cover-image: "media/TitlePage_small.png"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
bibliography: packages.bib
biblio-style: apalike
link-citations: yes
github-repo: "dyerlab/AMG"
---

# Applied Population Genetics {- .white_letters}

<a href="http://dyerlab.org">
```{r echo=FALSE}
knitr::include_graphics("media/APG-Title-Wide.png")
```
</a>

Build Date: **`r date()`**

The content of this text is modified, added to, and changed continually.  You are welcome to use the content of this book in its present form and you can provide feedback, comments, or suggestions for additions by contacting me at [rjdyer@vcu.edu](mailto://rjdyer@vcu.edu).  This work will continue to be hosted online and continually updated as new approaches and applications become available.

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
 library(knitr)
theme_set( theme_bw(base_size=14) )
knitr::opts_chunk$set( warning=FALSE, message=FALSE,error = FALSE )
```


## Logistics {-}

&copy; 2017 by R.J. Dyer.

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.  Under this license, authorized individuals may copy, distribute, display and perform the work and make derivative works and remixes based on this text only if they give the original author credit (attribution).  You are also free to distribute derivative works only under a license identical ("not more restrictive") to the license that governs this original work. 

Dr. Rodney Dyer is an Associate Professor and Director for the Center for Environmental Studies at Virginia Commonwealth University in Richmond, Virginia, USA.  His research focuses on genetic connectivity and structure and how the environment influences both.  More information on his research can be found at [http://dyerlab.org](http://dyerlab.org).


## Motivation {-}

This is, most of all, not a book about R.  This is also not a "Population Genetics in R" textbook.  It is a book about how we do population genetic analyses, for which R is a tool that allows us to reach beyond the limitations of point-and-click interfaces. As a field, Population Genetics has a broad set of textbooks describing the underlying theory.  As a student, I cut my teeth on the texts of Hartl (1981), Hartl & Clark (1997) and have used other great texts such as Hamilton (2011) and Hedrick (2009) in the classroom to teach the subject for the last decade.  In late 2015, there are a host of texts available to the student of population genetics—amazon lists 150 different books under the search term “Population Genetics textbook"—why do another one?  What I have found is that while the theory behind this discipline has been well developed, its application has been largely neglected.  

As a new graduate student, fresh out of my first population genetics course, I felt armed with the understanding of how microevolutionary processes influence the distribution of alleles within and among populations.  What I wasn't prepared for was sitting in front of the computer, looking at a few thousand individuals assayed for several different loci and actually ‘doing' population genetics.  All of those textbooks provide me with what is expected and the theory behind it, though often fall short on teaching me how I could apply those inferences to data I actually collect.  If you are a theoretical population geneticist, those texts and your ability to integrate mathematical equations will provide you a research lifetime of work.  However, if you are practitioner who uses population genetic tools to answer conservation, management, or ecologically inspired questions, the evolutionary expectations of population genetic processes will most likely not be as important as directly estimating inbreeding, exploring ongoing connectivity, or determining genetic granularity of existing populations.  This is where this textbook is focusing, a seemingly uninhabited niche in the knowledge ecosystem of graduate level population genetics.

This text was developed out of a graduate course in Population Genetics that I've been teaching at Virginia Commonwealth University since 2005.  This texts uses R and many additional libraries available within the R ecosystem to illustrate how to perform specific types of analyses and what kind of biological inferences we can gain from them.  In the process, we cover materials that are commonly needed in the application of population genetic analysis such as spatial autocorrelation, paternity analysis, and the use of permutation while at the same time highlighting logistical challenges commonly encountered in analyzing real data such as incomplete sampling, missing data, and rarefaction.

## Typography {-}

This text is designed primarily as an electronic publication (ePub) and has dynamical content included within.  If you are reading a static copy (as PDF or printed text), you are not getting the full experience of what this book has been designed to deliver.  Much of the text is devoted to quantitative analysis of data in R, and as such I've typeset the R components differently from the flowing text.  Text intended to be input into R is typeset as a fixed width font and colored using the default color scheme found in RStudio (http://rstudio.org).  Here are two input examples.

```{r}
value <- 20
name <- "Perdedor"
```


This text is amenable to copy-and-paste action so you can perform the same calculations on your computer as are done in the text.  When R returns an answer to an analysis or prints the contents of a variable out, the results are also typeset in fixed width font but each line is prefixed by two hash marks.  


```{r}
rnorm(10)
```

The lines with hashes are not something you are going to copy-and-paste into your R session, it is what R is giving you.  Inline code (e.g., code inserted into a sentence in a descriptive context such as discussing the `rnorm()` function in the previous example) is typeset similarly though not necessarily intended to be cut-and-pasted into your session.  

Throughout the ePub, there are also dynamical content.  Some of this content may be condensed text put into a scrolling text-box.  Here the notion is to provide direct access to information without unnecessarily adding length to the overall text.  Here is an example, showing the documentation associated with the R `help.start()` function within a scrolling text box.

<div class="scrollingbox"><pre>
help.start {utils}  											R Documentation

Hypertext Documentation

Description

Start the hypertext (currently HTML) version of R's online documentation.

Usage

help.start(update = FALSE, gui = "irrelevant", browser = getOption("browser"), remote = NULL)

Arguments

update - logical: should this attempt to update the package index to reflect the currently available packages. (Not attempted if remote is non-NULL.)
gui - just for compatibility with S-PLUS.
browser - the name of the program to be used as hypertext browser. It should be in the PATH, or a full path specified. Alternatively, it can be an R function which will be called with a URL as its only argument. This option is normally unset on Windows, when the file-association mechanism will be used.
remote - A character string giving a valid URL for the ‘R_HOME' directory on a remote location.

Details

Unless remote is specified this requires the HTTP server to be available (it will be started if possible: see startDynamicHelp).

One of the links on the index page is the HTML package index, ‘R.home("docs")/html/packages.html', which can be remade by make.packages.html(). For local operation, the HTTP server will remake a temporary version of this list when the link is first clicked, and each time thereafter check if updating is needed (if .libPaths has changed or any of the directories has been changed). This can be slow, and using update = TRUE will ensure that the packages list is updated before launching the index page.

Argument remote can be used to point to HTML help published by another R installation: it will typically only show packages from the main library of that installation.

See Also

help() for on- and off-line help in other formats.
browseURL for how the help file is displayed.
RSiteSearch to access an on-line search of R resources.

Examples

help.start()
## Not run: 
## the 'remote' arg can be tested by
help.start(remote = paste0("file://", R.home()))

## End(Not run)
</pre></div>


In addition to shoving static content into scrolling windows, longer R scripts will also be inserted into these widgets so that space can be saved while preserving syntax highlighting and code format.


## Interactive Content {-}

Where possible, I have included interactive content in the text.  Examples include dynamical plots such as embedding google map objects in the page, network structure that can be manipulated in the browser, and widgets that show how a process influences popualtion genetic features by allowing you to direclty manipulate the parameters in the model.  


## Dedication {-}

I would like to dedicate this text and the motivations and inspirations that underly its creation, to my thesis advisor, Dr. Victoria L. Sork.  She has an uncanny ability to see beyond the questions that we know how to answer and to help us focus on the answers that we are really trying to find.


<!--chapter:end:index.rmd-->

# The R Ecosystem {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_farm.jpg"></div>

> R is an open-source, community driven platform available for a wide variety of computer operating systems and is becoming a de facto standard in modern biological analyses.  It is not a point-and-click interface, rather it is a rich analytical ecosystem that will make your research and scientific life more pleasurable.

## Getting R Configured

The grammar of the R language was derived from another system called S-Plus.  S-Plus was a proprietary analysis platform developed by AT&T and licenses were sold for its use, mostly in industry and education.  Given the closed nature of the S-Plus platform, R was developed with the goal of creating an interpreter that could read grammar similar to S-Plus but be available to the larger research community.   The use of R has increased dramatically due to its open nature and the ability of people to share code solutions with relatively little barriers.

The main repository for R is located at the [CRAN Repository](http://cran.r-project.org), which is where you can download the latest version.  It is in your best interests to make sure you update the underlying R system, changes are made continually (perhaps despite the outward appearance of the website).  

```{r echo=FALSE}
knitr::include_graphics("./media/CRAN.png")
```

The current version of this book uses version 3.2.  To get the correct version, open the page and there should be a link at the top for your particular computing platform. Download the appropriate version and install it following the instructions appropriate for your computer.

### Packages

The base R system comes with enough functionality to get you going.  By default, there is only a limited amount of functionality in R, which is a great thing.  You only load and use the packages that you intend to use.  There are just too many packages to have them all loaded into memory at all times and there is such a broad range of packages, it is not likely you'd need more than a small fraction of the packages during the course of all your analyses.
Once you have a package, you can tell R that you intend to use it by either

```{r eval=FALSE}
library(package_name)
```

or

```{r eval=FALSE}
 library(package_name)
```

They are approximately equivalent, differing in only that the second one actually returns a TRUE or FALSE indicating the presence of that library on you machine.  If you do not want to be mocked by other users of R, I would recommend the `library` version---there are situations where `require` will not do what you think you want it to do (even though it is a verb and should probably be the correct one to use grammatically).  

There are, at present, a few different places you can get packages. The packages can either be downloaded from these online repositories and installed into R or you can install them from within R itself.  Again, I'll prefer the latter as it is a bit more straightforward.   

### CRAN

The main repository for packages is hosted by the r-project page itself.  There are packages with solutions to analyses ranging from Approximate Bayesian Computation to Zhang & Pilon's approaches to characterizing climatic trends.  The list of these packages is large and ever growing.  It can be found on CRAN under the packages menu. To install a package from this repository, you use the function 

```{r eval=FALSE}
install.packages("thePackageName") 
```

You can see that R went to the CRAN mirror site (I use the rstudio one), downloaded the package, look for particular dependencies that that package may have and download them as well for you.  It should install these packages for you and give you an affirmative message indicating it had done so.  

At times, there are some packages that are not available in binary form for all computer systems (the rgdal package is one that comes to mind for which we will provide a work around later) and the packages need to be compiled.  This means that you need to have some additional tools and/or libraries on your machine to take the code, compile it, and link it together to make the package.  In these cases, the internet and the documentation that the developer provide are key to your sanity.

### GitHub

There are an increasing number of projects that are being developed either in the open source community or shared among a set of developers.  These projects are often hosted on http://www.github.com where you can get access to the latest code updates.  The packages that I develop are hosted on Github at (http://github.com/dyerlab) and only pushed to CRAN when I make major changes. 

```{r echo=FALSE}
knitr::include_graphics("./media/DyerlabGithub.png")
```


To install packages from Github you need to install the devtools library from CRAN first

```{r eval=FALSE}
install.packages("devtools")
library(devtools)
```

Then you need to use the devtools function `install_github()` to go grab it.  To do so you need two separate pieces of information, the name of the developer who is creating the repository and the name of the repository it is contained within.  For the gstudio package, the develop is `dyerlab` and the repository name is `gstudio`. If you are comfortable with git, you can also check out certain branches of development if you like with optional arguments to the `install_github()` function.  Here is how you would install both the `gstudio` and `popgraph` libraries I use in this book.

```{r eval=FALSE}
install_github("dyerlab/gstudio")
install_github("dyerlab/popgraph")
```

If you are starting to work with R and intend to compile packages, there are some tools that you will need to have on your machine.  For Windows machines, there is an rtools download EXE that you can get.  On OSX, you need to download the developers tools from Apple (available on the Application Store for free).  In either case, the developer (if they care about their code being used by others) should provide sufficient documentation.  If you are working on Windows, I do not know how that system (and never have used it for any prolonged period of time) so you'll have to look to the internet on where to find compilers and other developer tools.

### Bioconductor

The last main location to find packages is on the Bioconductor site, a collection of software for bioinformatic analyses.  To install from the bioconductor site you need to download their own installer as:

```{r eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite()
```

And then to install packages you use 

```{r eval=FALSE}
biocLite(c("GenomicFeatures", "AnnotationDbi"))
```


There are no libraries used in this text from bioconductor as this text is more focused on marker-based population genetics and not sequences, annotations, etc. 

### Troublesome Packages

Some packages provide a unique set of problems for getting them onto your computer.  There are a variety of reasons for this and Google is your friend (though it would be easier if R was named something more unique than the 18^^th^^ letter of the alphabet...). 

#### RGDAL & RGEOS {#rgdal-rgeos-packages}

Every time I upgrade in any significant way, two R libraries seem to raise their ugly heads and scream like a spoiled child—rgdal and rgeos.  Why do these two have to be SOOOO much of a pain? Why can't we have a auto build of a binary with all the options in it for OSX? Who knows? I always feel like I get the fuzzy end of the lollipop with these two. Here is my latest approach for getting them going.

First you have to make sure you have the latest GDAL libraries. I used to get mine from [Kyngchaos](http://www.kyngchaos.com/software/frameworks), just download the framework, install it, and then do some kind of long R CMD INSTALL dance, which seems to no longer work for me. I also tried installing from Ripley's repository and found that (a) It was a version older than the one I already had on my machine, and (b) you can't install from that repository, there is a malformed header and the install.packages() function just barfs.

Time to try something new. I typically stay away from the various installer frameworks out there on OSX to keep everything in Frameworks. But this time, I used MacPorts. You can find the latest version here. Here is how I got it to help me out.  

- Download XCode from Apple, it is both free and has a lot of tools that make your life as a programmer easier.  It is a big package and you'll have to install the command line developer tools as well.  You will be prompted on how to do this.
- Downloaded the version of macports for your OS, I'm currently on 10.11 and installed it with little problems.  It takes a bit of time at the end of the installation because it is downloading a lot of information.  Be patient. 
- In the terminal, I updated it `sudo ports -v selfupdate` and again be patient, it is going to grab a lot of stuff from the internet.
- I then used it to install `gdal` as a unix library (rather than as a framework so it won't be located in /Library/Frameworks) by sudo ports install `gdal`. There were a lot of dependencies for this one so it took a while.
- I then had R install rgdal as `install.packages( rgdal, type="source")`

Worked like a charm.


## Libraries Used in Text

This work requires several libraries that you may need to get from either CRAN or GitHub. The following if you run the following code, you should be up-to-date on the necessary packages used throughout this text.

```{r echo=FALSE, warning=FALSE}
files <- list.files(".",pattern=".rmd")
libraries <- c("gstudio","popgraph","ggplot2")
for( file in files ) {
  suppressWarnings(s <- system( paste("grep -E -o 'library\\(\\w+\\)'",file), intern=TRUE))
  if( length(s) > 0 ) {
    s <- strsplit(s,"(",fixed=TRUE)
    for( item in s ){
      if( length(item) == 2){
        library <- strsplit(item[2],")",fixed=TRUE)[[1]]
        libraries <- c(libraries,library)
      }
    }
  }
}
pkgs <- sort( unique(libraries) )
idx <- which( pkgs == "package_name" )
pkgs <- pkgs[-idx]
pkgs_df <- data.frame(Name = pkgs, Title = NA)
for(i in seq_along(pkgs)){
  f = system.file(package = pkgs[i], "DESCRIPTION")
  if( nchar(f)> 1) {
    # Title is always on 3rd line
    title = readLines(f)
    title = title[grep("Title: ", title)]
    pkgs_df$Title[i] = gsub("Title: ", "", title)    
  }
}
knitr::kable(pkgs_df,caption = "R packages used in the examples shown in this book.")

# make sure these libraries are already installed on this machine
inst_pkgs <- installed.packages()
to_install <- setdiff( pkgs, inst_pkgs )
if( length(to_install))
  install.packages(to_install, repos="https://cran.rstudio.org")


# See if maxent.jar is installed in the dismo java package.
path <- system.file("java", package="dismo")
files <- list.files(system.file("java", package="dismo"))
if( !("maxent.jar" %in% files) )
  system( paste("cp ./spatial_data/maxent.jar", path) )

```


## The RStudio Environment

By itself, R can be used in the terminal or as the basic interface the installer provides.  While both of these methods are sufficient for working in R, they are less than optimal (IMHO).  I believe one of the most important tools you can use is a good IDE as it helps you organize and work with your data in ways that focus more on the outcome and less on the implementation.  I've spent a lot of time in both Emacs and Vim and while they may be tools for the geek elite, RStudio has made me much more productive in terms of output per unit time.

```{r echo=FALSE}
knitr::include_graphics("./media/RStudioIDE.png")
```

The RStudio IDE can be can be downloaded directly from (http://rstudio.org) and comes in varieties for desktops and servers.  The packages are easy to install and contain all the instructions you need.  If you run your own server, you can install it and do your analyses via a web interface that looks identical to the desktop client (in fact it is more similar than you perhaps know).  

The interface has four panes, two of which (the source code editor and the terminal output) you will work with the most often.  The other panes have information on the history of your project, plot output, packages, help, and a file browser.  You can type in R commands in the terminal window and receive responses directly, just as normal.  However, you can also type in your code into a script file (files ending in *.R) and run them directly.  You can also create Markdown and LaTeX documents embedding your code in with the verbiage, which is evaluated any time you make a html, rtf, docx, or pdf output.  All the code from R for this book was written in RMarkdown and parsed as an html document.  If you are a serious user of R, you can use LaTeX or RMarkdown to make documents, slides, presentations, posters, etc.  It is a versatile tool and one worth looking into, especially as it pertains to reproducible research (something we all need to strive for).


<!--chapter:end:r_ecosystem.rmd-->

# Data Types {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_milipedes.jpg"></div>

> The data we work with comes in many forms—integers, stratum, categories, genotypes, etc.—all of which we need to be able to work with in our analyses. In this chapter, the basic data types we will commonly use in population genetic analyses.  This section covers some of the basic types of data we will use in R.  These include numbers, character, factors, and logical data types.  We will also introduce the locus object from the gstudio library and see how it is just another data type that we can manipulate in R.  

The very first hurdle you need to get over is the oddness in the way in which R assigns values to variables.

```{r eval=FALSE}
variable <- value
```

Yes that is a less-than and dash character.  This is the assignment operator that historically has been used and it is the one that I will stick with.  In some cases you can use the ‘=' to assign variables instead but then it takes away the R-ness of R itself.  For decision making, the equality operator (e.g., is this equal to that) is the double equals sign ‘=='.  We will get into that below where we talk about logical types and later in decision making.

If you are unaware of what type a particular variable may be, you can always use the `type()` function and R will tell you.

```{r eval=FALSE}
class( variable )
```

R also has a pretty good help system built into itself.  You can get help for any function by typing a question mark in front of the function name.  This is a particularly awesome features because at the end of the help file, there is often examples of its usage, which are priceless.  Here is the documentation for the ‘help' function as given by:

```{r eval=FALSE}
?help
```

There are also package vignettes available (for most packages you download) that provide additional information on the routines, data sets, and other items included in these packages. You can get a list of vignettes currently installed on your machine by:

```{r eval=FALSE}
vignette()
```

and vignettes for a particular package by passing the package name as an argument to the function itself.

## Numeric Data Types

The quantitative measurements we make are often numeric, in that they can be represented as as a number with a decimal component (think weight, height, latitude, soil moisture, ear wax viscosity, etc.).  The most basic type of data in R, is the numeric type and represents both integers and floating point numbers (n.b., there is a strict integer data type but it is often only needed when interfacing with other C libraries and can for what we are doing be disregarded).  

Assigning a value to a variable is easy

```{r}
x <- 3
x
```

By default, R automatically outputs whole numbers numbers within decimal values appropriately.

```{r}
y <- 22/7
y
```

If there is a mix of whole numbers and numbers with decimals together in a container such as

```{r}
c(x,y)
```

then both are shown with decimals. The `c()` part here is a function that combines several data objects together into a vector and is very useful.  In fact, the use of vectors are are central to working in R and functions almost all the functions we use on individual variables can also be applied to vectors.

A word of caution should be made about numeric data types on any computer.  Consider the following example.

```{r}
x <- .3 / 3
x
```

which is exactly what we'd expect.  However, the way in which computers store decimal numbers plays off our notion of significant digits pretty well.  Look what happens when I print out x but carry out the number of decimal places.

```{r}
print(x, digits=20)
```

Not quite 0.1 is it?  Not that far away from it but not exact.  That is a general problem, not one that R has any more claim to than any other language and/or implementation.  Does this matter much, probably not in the realm of the kinds of things we do in population genetics, it is just something that you should be aware of.
You can make random sets of numeric data by using using functions describing various distributions.  For example, some random numbers from the normal distribution are:

```{r}
rnorm(10)
```

from the normal distribution with designated mean and standard deviation:

```{r}
rnorm(10,mean=42,sd=12)
```

A poisson distribution with mean 2:

```{r}
rpois(10,lambda = 2)
```

and the $\chi^2$ distribution with 1 degree of freedom:

```{r}
rchisq(10, df=1)
```

There are several more distributions that if you need to access random numbers, quantiles, probability densities, and cumulative density values are available.

### Coercion to Numeric

All data types have the potential ability to take another variable and coerce it into their type.  Some combinations make sense, and some do not.  For example, if you load in a CSV data file using read_csv(), and at some point a stray non-numeric character was inserted into one of the cells on your spreadsheet, R will interpret the entire column as a character type rather than as a numeric type.  This can be a very frustrating thing, spreadsheets should generally be considered evil as they do all kinds of stuff behind the scenes and make your life less awesome.

Here is an example of coercion of some data that is initially defined as a set of characters

```{r}
x <- c("42","99")
x
```

and is coerced into a numeric type using the as.numeric() function. 

```{r}
y <- as.numeric( x )
y
```


It is a built-in feature of the data types in R that they all have (or should have if someone is producing a new data type and is being courteous to their users) an `as.X()` function.  This is where the data type decides if the values asked to be coerced are reasonable or if you need to be reminded that what you are asking is not possible.  Here is an example where I try to coerce a non-numeric variable into a number.

```{r}
x <- "The night is dark and full of terrors..."
as.numeric( x )
```


By default, the result should be NA (missing data/non-applicable) if you ask for things that are not possible.

## Characters

A collection of letters, number, and or punctuation is represented as a character data type.  These are enclosed in either single or double quotes and are considered a single entity.  For example, my name can be represented as:

```{r}
prof <- "Rodney J. Dyer"
prof

```

In R, character variables are considered to be a single entity, that is the entire prof variable is a single unit, not a collection of characters.  This is in part due to the way in which vectors of variables are constructed in the language.  For example, if you are looking at the length of the variable I assigned my name to you see

```{r}
length(prof)
```

which shows that there is only one ‘character' variable.  If, as is often the case, you are interested in knowing how many characters are in the variable prof, then you use the 

```{r}
nchar(prof)
```


function instead.  This returns the number of characters (even the non-printing ones like tabs and spaces.

```{r}
nchar(" \t ")
```

As all other data types, you can define a vector of character values using the `c()` function.

```{r}
x <- "I am"
y <- "not"
z <- 'a looser'
terms <- c(x,y,z)
terms
```

And looking at the `length()` and `nchar()` of this you can see how these operations differ.
```{r}
length(terms)
nchar(terms)
```


### Concatenation of Characters

Another common use of characters is concatenating them into single sequences.  Here we use the function `paste()` and can set the separators (or characters that are inserted between entities when we collapse vectors).  Here is an example, entirely fictional and only provided for instructional purposes only.

```{r}
paste(terms, collapse=" ")
```


```{r}
paste(x,z)
```


```{r}
paste(x,z,sep=" not ")
```

### Coercion to Characters

A character data type is often the most basal type of data you can work with.  For example, consider the case where you have named sample locations.  These can be kept as a character data type or as a factor (see below).  There are benefits and drawbacks to each representation of the same data (see below).  By default (as of the version of R I am currently using when writing this book), if you use a function like read_table() to load in an external file, columns of character data will be treated as factors.  This can be good behavior if all you are doing is loading in data and running an analysis, or it can be a total pain in the backside if you are doing more manipulative analyses.

Here is an example of coercing a numeric type into a character type using the `as.character()` function.  

```{r}
x <- 42
x
```

```{r}
y <- as.character(x)
y
```

## Factors

A factor is a categorical data type.  If you are coming from SAS, these are class variables.  If you are not, then perhaps you can think of them as mutually exclusive classifications.  For example, an sample may be assigned to one particular locale, one particular region, and one particular species.  Across all the data you may have several species, regions, and locales.  These are finite, and defined, sets of categories. One of the more common headaches encountered by people new to R is working with factor types and trying to add categories that are not already defined.

Since factors are categorical, it is in your best interest to make sure you label them in as descriptive as a fashion as possible.  You are not saving space or cutting down on computational time to take shortcuts and label the locale for Rancho Santa Maria as RSN or pop3d or 5.  Our computers are fast and large enough, and our programmers are cleaver enough, to not have to rename our populations in numeric format to make them work (hello STRUCTURE I'm calling you out here).  The only thing you have to loose by adopting a reasonable naming scheme is confusion in your output.

To define a factor type, you use the function `factor()` and pass it a vector of values.

```{r}
region <- c("North","North","South","East","East","South","West","West","West")
region <- factor( region )
region
```


When you print out the values, it shows you all the levels present for the factor.  If you have levels that are not present in your data set, when you define it, you can tell R to consider additional levels of this factor by passing the optional levels= argument as:

```{r}
region <- factor( region, levels=c("North","South","East","West","Central"))
region
```

If you try to add a data point to a factor list that does not have the factor that you are adding, it will give you an error (or ‘barf' as I like to say).

```{r}
region[1] <- "Bob"
```

Now, I have to admit that the Error message in its entirety, with its `“[<-.factor`(`*tmp*`, 1, value = "Bob")"` part is, perhaps, not the most informative.  Agreed.  However, the “invalid factor level" does tell you something useful.  Unfortunately, the programmers that put in the error handling system in R did not quite adhere to the spirit of the “fail loudly" mantra.  It is something you will have to get good at.  Google is your friend, and if you post a questions to (http://stackoverflow.org) or the R user list without doing serious homework, put on your asbestos shorts!

Unfortunately, the error above changed the first element of the region vector to NA (missing data). I'll turn it back before we move too much further.

```{r}
region[1] <- "North"
```

Factors in R can be either unordered (as say locale may be since locale A is not `>`, `=`, or `<` locale B) or they may be ordered categories as in `Small < Medium < Large < X-Large`.  When you create the factor, you need to indicate if it is an ordered type (by default it is not).   If the factors are ordered in some way, you can also create an ordination on the data.  If you do not pass a levels= option to the `factors()` function, it will take the order in which they occur in data you pass to it.  If you want to specify an order for the factors specifically, pass the optional `levels=` and they will be ordinated in the order given there.

```{r}
region <- factor( region, ordered=TRUE, levels = c("West", "North", "South", "East") )
region
```

### Missing Levels in Factors

There are times when you have a subset of data that do not have all the potential categories. 

```{r}
subregion <- region[ 3:9 ]
subregion
```

```{r}
table( subregion )
```


## Logical Types

A logical type is either TRUE or FALSE, there is no in-between.  It is common to use these types in making decisions (see if-else decisions) to check a specific condition being satisfied.  To define logical variables you can either use the TRUE or FALSE directly

```{r}
canThrow <- c(FALSE, TRUE, FALSE, FALSE, FALSE)
canThrow
```

or can implement some logical condition

```{r}
stable <- c( "RGIII" == 0, nchar("Marshawn") == 8)
stable
```

on the variables.  Notice here how each of the items is actually evaluated as to determine the truth of each expression.  In the first case, the character is not equal to zero and in the second, the number of characters (what `nchar()` does) is indeed equal to 8 for the character string “Marshawn".

It is common to use logical types to serve as indices for vectors.  Say for example, you have a vector of data that you want to select some subset from.

```{r}
data <- rnorm(20)
data
```


Perhaps you are on interested in the non-negative values

```{r}
data[ data > 0 ]
```

If you look at the condition being passed to as the index

```{r}
data > 0
```

you see that individually, each value in the data vector is being evaluated as a logical value, satisfying the condition that it is strictly greater than zero.  When you pass that as indices to a vector it only shows the indices that are `TRUE`.  

You can coerce a value into a logical if you understand the rules.  Numeric types that equal 0 (zero) are `FALSE`, always.  Any non-zero value is considered `TRUE`.  Here I use the modulus operator, `%%`, which provides the remainder of a division. 

```{r}
1:20 %% 2
```

which used as indices give us

```{r}
data[ (1:20 %% 2) > 0 ]
```

You can get as complicated in the creation of indices as you like, even using logical operators such as OR and AND. I leave that as an example for you to play with.  


<!--chapter:end:datatypes.rmd-->

# Data Containers {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_turtle.jpg"></div>

We almost never work with a single datum^[The word *data* is plural, datum is singular], rather we keep lots of data.  Moreover, the kinds of data are often heterogeneous, including categorical (Populations, Regions),  continuous (coordinates, rainfall, elevation), imagry (hyperspectral, LiDAR), and perhaps even genetic.  R has a very rich set of containers into which we can stuff our data as we work with it.  Here these container types are examined and the restrictions and benefits associated with each type are explained.

## Vectors

We have already seen several examples of several vectors in action (see the introduction to Numeric data types for example).  A vector of objects is simply a collection of them, often created using the `c()` function (*c* for combine).  Vectorized data is restricted to having homogeneous data types---you cannot mix character and numeric types in the same vector.  If you try to mix types, R will either coerce your data into a reasonable type

```{r}
x <- c(1,2,3)
x
y <- c(TRUE,TRUE,FALSE)
y
z <- c("I","am","not","a","looser")
z
```


or coearce them into one type that is amenable to all the types of data that you have given it.  In this example, a Logical, Character, Constant, and Function are combined resulting in a vector output of type Character.

```{r}
w <- c(TRUE, "1", pi, ls())
w
class(w)
```

Accessing elements within a vector are done using the square bracket `[]` notation.  All indices (for vectors and matrices) start at 1 (not zero as is the case for some languages).  Getting and setting the components within a vector are accomplished using numeric indices with the assignment operators just like we do for variables containing a single value.

```{r}
x
x[1] <- 2
x[3] <- 1
x
x[2]
```

A common type of vector is that of a sequences.  We use sequences all the time, to iterate through a list, to counting generations, etc.  There are a few ways to generate sequences, depending upon the step sequence.  For a sequence of whole numbers, the easiest is through the use of the colon operator.

```{r}
x <- 1:6
x
```

This provides a nice shorthand for getting the values X:Y from X to Y, inclusive.  It is also possible to go backwards using this operator, counting down from X to Y as in:

```{r}
x <- 5:2
x
```

The only constraint here is that we are limited to a step size of 1.0.  It is possible to use non-integers as the bounds, it will just count up by 1.0 each time.

```{r}
x <- 3.2:8.4
x
```

If you are interested in making a sequence with a step other than 1.0, you can use the `seq()` function.  If you do not provide a step value, it defaults to 1.0.

```{r}
y <- seq(1,6)
y
```

But if you do, it will use that instead.

```{r}
z <- seq(1,20,by=2)
z
```

It is also possible to create a vector of objects as repetitions using the `rep()` (for repeat) function.

```{r}
rep("Beetlejuice",3)
```

If you pass a vector of items to `rep()`, it can repeat these as either a vector being repeated (the default value)

```{r}
x <- c("No","Free","Lunch")
rep(x,time=3)
```

or as each item in the vector repeated.

```{r}
rep(x,each=3)
```


## Matrices 

A matrix is a 2- or higher dimensional container, most commonly used to store numeric data types.  There are some libraries that use matrices in more than two dimensions (rows and columns and sheets), though you will not run across them too often.  Here I restrict myself to only 2-dimensional matrices.

You can define a matrix by giving it a set of values and an indication of the number of rows and columns you want.  The easiest matrix to try is one with empty values:

```{r}
matrix(nrow=2, ncol=2)
```

Perhaps more useful is one that is pre-populated with values.

```{r}
matrix(1:4, nrow=2 )
```

Notice that here, there were four entries and I only specified the number of rows required.  By default the ‘filling-in' of the matrix will proceed down column (*by-column*).  In this example, we have the first column with the first two entries and the last two entries down the second column.  If you want it to fill by row, you can pass the optional argument

```{r}
matrix(1:4, nrow=2, byrow=TRUE)
```

and it will fill *by-row*.

When filling matrices, the default size and the size of the data being added to the matrix are critical.  For example, I can create a matrix as:

```{r}
Y <- matrix(c(1,2,3,4,5,6),ncol=2,byrow=TRUE)
Y
```

or 

```{r}
X <- matrix(c(1,2,3,4,5,6),nrow=2)
X
```

and both produce a similar matrix, only transposed.

```{r}
X == t(Y)
```

In the example above, the number of rows (or columns) was a clean multiple of the number of entries.  However, if it is not, R will fill in values.

```{r}
X <- matrix(c(1,2,3,4,5,6),ncol=4, byrow=TRUE)
```

Notice how you get a warning from the interpreter.  But that does not stop it from filling in the remaining slots by starting over in the sequence of numbers you passed to it.

```{r}
X
```

The dimensionality of a matrix (and `data.frame` as we will see shortly) is returned by the `dim()` function.  This will provide the number of rows and columns as a vector.

```{r}
dim(X)
```

Accessing elements to retrieve or set their values within a matrix is done using the square brackets just like for a vector but you need to give `[row,col]` indices.  Again, these are 1-based so that 

```{r}
X[1,3]
```

is the entry in the 1st row and 3rd column.

You can also use ‘slices' through a matrix to get the rows

```{r}
X[1,]
```

or columns

```{r}
X[,3]
```

of data. Here you just omit the index for the entity you want to span.  Notice that when you grab a slice, even if it is a column, is given as a vector.  

```{r}
length(X[,3])
```

You can grab a sub-matrix using slices if you give a range (or sequence) of indices.

```{r}
X[,2:3]
```

If you ask for values from a matrix that exceed its dimensions, R will give you an error.
 
```{r,eval=FALSE}
X[1,8]
```

```
## Error in X[1, 8] : subscript out of bounds
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```

There are a few cool extensions of the `rep()` function that can be used to create matrices as well.  They are optional values that can be passed to the function. 

* `times=x`: This is the default option that was occupied by the ‘3' in the example above and represents the number of times that first argument will be repeated.  
* `each=x` This will take each element in the first argument are repeat them `each` times.   
* `length.out=x`: This make the result equal in length to `x`.  

In combination, these can be quite helpful.  Here is an example using numeric sequences in which it is necessary to find the index of all entries in a 3x2 matrix.  To make the indices, I bind two columns together using `cbind()`.  There is a matching row binding function, denoted as `rbind()` (perhaps not so surprisingly).  What is returned is a matrix

```{r}
indices <- cbind( rep(1:2, each=3), rep(1:3,times=2), rep(5,length.out=6)  )
indices
```


## Lists

A list is a type of vector but is indexed by ‘keys' rather than by numeric indices.  Moreover, lists can contain heterogeneous types of data (e.g., values of different `class`), which is not possible in a vector type.  For example, consider the list

```{r}
theList <- list( x=seq(2,40, by=2), dog=LETTERS[1:5], hasStyle=logical(5) )
summary(theList)
```

which is defined with a numeric, a character, and a logical component.  Each of these entries can be different in length as well as type.  Once defined, the entries may be observed as:

```{r}
theList
```

Once created, you can add variables to the list using the $-operator followed by the name of the key for the new entry.

```{r}
theList$my_favoriate_number <- 2.9 + 3i
```

or use double brackets and the name of the variable as a character string.

```{r}
theList[["lotto numbers"]] <- rpois(7,lambda=42)
```

The keys currently in the list are given by the `names()` function

```{r}
names(theList)
```

Getting and setting values within a list are done the same way using either the `$`-operator

```{r}
theList$x
theList$x[2] <- 42
theList$x
```

or the double brackets

```{r}
theList[["x"]]
```

or using a numeric index, but that numeric index is looks to the results of `names()` to figure out which key to use.

```{r}
theList[[2]]
```

The use of the double brackets in essence provides a direct link to the variable in the list whose name is second in the `names()` function (*dog* in this case).  If you want to access elements within that variable, then you add a second set of brackets on after the double ones.  

```{r}
theList[[1]][3]
```

This deviates from the matrix approach as well as from how we access entries in a `data.frame` (described next).  It is not a single square bracket with two indices, that gives you an error:

```{r eval=FALSE}
theList[1,3] 
```

```
## Error in theList[1, 3] : incorrect number of dimensions
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted
```

List are rather robust objects that allow you to store a wide variety of data types (including nested lists).  Once you get the indexing scheme down, it they will provide nice solutions for many of your computational needs.

## Data Frames

The `data.frame` is the default data container in R.   It is analogous to both a spreadsheet, at least in the way that I have used spreadsheets in the past, as well as a database.  If you consider a single spreadsheet containing measurements and observations from your research, you may have many columns of data, each of which may be a different kind of data.  There may be `factors` representing designations such as species, regions, populations, sex, flower color, etc.  Other columns may contain numeric data types for items such as latitude, longitude, dbh, and nectar sugar content.  You may also have specialized columns such as dates collected, genetic loci, and any other information you may be collecting.  

On a spreadsheet, each column has a unified data type, either quantified with a value or as a missing value, `NA`, in each row.  Rows typically represent the sampling unit, perhaps individual or site, along which all of these various items have been measured or determined.  A `data.frame` is similar to this, at least conceptually.  You define a `data.frame` by designating the columns of data to be used.  You do not need to define all of them, more can be added later.  The values passed can be sequences, collections of values, or computed parameters.  For example:

```{r}
df <- data.frame( ID=1:5, Names=c("Bob","Alice","Vicki","John","Sarah"), Score=100 - rpois(5,lambda=10))
df
```

You can see that each column is a unified type of data and each row is equivalent to a record.  Additional data columns may be added to an existing data.frame as:

```{r}
df$Passed_Class <- c(TRUE,TRUE,TRUE,FALSE,TRUE)
```

Since we may have many (thousands?) of rows of observations, a `summary()` of the data.frame can provide a more compact description.

```{r}
summary(df)
```


We can add columns of data to the data.frame after the fact using the `$`-operator to indicate the column name. Depending upon the data type, the summary will provide an overview of what is there.

### Indexing Data Frames

You can access individual items within a `data.frame` by numeric index such as:

```{r}
df[1,3]
```

You can slide indices along rows (which return a new `data.frame` for you)

```{r}
df[1,]
```

or along columns (which give you a vector of data)

```{r}
df[,3]
```

or use the `$`-operator as you did for the list data type to get direct access to a either all the data or a specific subset therein.

```{r}
df$Names[3]
```

Indices are ordered just like for matrices, rows first then columns.  You can also pass a set of indices such as:

```{r}
df[1:3,]
```

It is also possible to use logical operators as indices.  Here I select only those names in the data.frame whose score was >90 and they passed popgen.

```{r}
df$Names[df$Score > 90 & df$Passed_Class==TRUE]
```


This is why `data.frame` objects are very database like.  They can contain lots of data and you can extract from them subsets that you need to work on.  This is a VERY important feature, one that is vital for reproducible research.  Keep you data in one and only one place.


<!--chapter:end:r_containers.rmd-->

# Programming {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_beach_ponies.jpg"></div>

One of the strengths of R as an analysis platform is that it is a language rather than a program.  With programs, such as SPSS & JMP, you are limited by the functionality that the designers thought would be necessary to meet the broadest audience.  In R, you can rely upon simple functions or you can create entire analysis and simulation programs de novo.  To do this, we need to dig into flow control and decision making processes, both of which you need for doing more in-depth programming.

## Function Writing

Here we look at how to create an R function.  Writing small functions like this is a huge benefit to you as an analyst and this is a great place to start.  A function in R is defined as:

```{r eval=FALSE}
function_name <- function( arguments ) { Stuff you want the function to do }
```

You define a function name for whatever you like and assign it the stuff to the right.  In R, the function named `function()` is a special one, it tells R that you are about to create a little routine and you want that set of code to be available to you for later use under the name of whatever you named it.  This allows a tremendous amount of flexibility as you develop your own set of routines and analyses for your work.  The part that actually does stuff is after the function call.  It may be that the function that you create does need some data (those are the arguments) or you may not need any input in to the function (in which you pass no arguments). It all depends upon what you are creating.

The key to understanding functions is that they are encapsulations of code—a shortcut for a sequence of instructions if you will not have to type over and over again.  The less typing you do, the lower the probability that you will have errors (and all code has errors).  

Here is an example of some code that I'm going to develop into a function.  This function will allow me to determine if one genotype could possibly be the offspring of the other genotype.  

```{r message=FALSE, warning=FALSE}
library(gstudio)
loc1 <- locus( c(128,130) )
loc2 <- locus( c(128,128) )
cat( loc1, loc2 )
```


We start out with two loci, a `128:130` heterozygote and a `128:128` homozygote.  These may represent repeat motifs at a microsatellite locus or some other co-dominant genotype.  First, I'll break the locus into a vector of genotypes.

```{r}
off.alleles <- alleles( loc1 )
off.alleles
mom.alleles <- alleles( loc2 )
mom.alleles
```


To be a valid potential offspring there should be at least one of the alleles in the parent that matches the allele in the offspring.  The `intersect()` function returns the set of values common to both vectors.

```{r}
shared <- intersect( off.alleles, mom.alleles )
shared
```


If it has at least one of the alleles present (it could have both if parent and offspring are both the same heterozygote) then you cannot exclude this individual as a potential offspring.  If there are no alleles in common, then the value returned is an empty vector.

```{r}
loc3 <- locus( c(132,132))
dad.alleles <- alleles( loc3 )
intersect( mom.alleles, dad.alleles )
```

This logic can be shoved into a function.  You have to wrap it into a set of curly brackets.  I use the length of the result from the intersect() to return from the function.  Potential values for 

```{r}
potential_offspring <- function( parent, offspring ) {
  off <- alleles( offspring )
  par <- alleles( loc2 )
  shared <- intersect( off, par )
  return( length( shared ) > 0 )
}
```
Now, you can call this function anytime you need, just passing it two genotypes.  If they can be offspring it returns TRUE, as in the comparison between 128:130 and 128:128 genotypes.

```{r}
potential_offspring(loc1, loc2)
```
And it returns FALSE for the comparison between 128:128 and 132:132.
```{r}
potential_offspring(loc2, loc3)
```

## Variable Scope

There is a lot more information on writing functions and we will get into that as we progress through the text.  However, it is important that I bring this up now.  The value assigned to a variable is defined by its scope.  Consider the following code
```{r}
x <- 10
```

and the function defined as 

```{r}
do_it <- function( x ) {
  x <- x + 10
  return( x )
}
```

When I call the function, the variable x that is the argument of the function is not the same variable that is in the environment that I assigned a value of 10.  The x in the function argument is what we call “local to that function" in that within the curly brackets that follow (and any number of curly brackets nested within those, the value of x is given whatever was passed to the function.

## Decision Making

We interact with our data in many ways and introspection of the values we have in the variables we are working with are of prime importance.  Decision making in your code is where you evaluate your data and make a choice of outcomes based upon some criteria.  Here is some example data that we can use as we explore the basics of `if()`, `if(){} else{}`, and `if(){} elif(){} else{}` coding patterns.

### The if Pattern  

The most basic version of decision making is asking a single question and if the answer is TRUE then do something.  The `if(){}` function does this and has the form

```{r eval=FALSE}
if( CRITERIA ) {
	DO_SOMETHING
}
```

You pass a logical statement (or something that can be coerced into a logical type) to the function as the CRITERIA and if it evaluates to `TRUE`, then the contents of the `DO_SOMETHING` are executed.  If the value of `CRITERIA` is not `TRUE` the `DO_SOMETHING` is skipped entirely—it is not even seen by the interpreter.  

Here we can test this out using the loci defined above along with the is_heterozygote() function.  This function takes one or more locus objects and returns TRUE/FALSE if they are or are not a heterozygote.

```{r}
is_heterozygote( c(loc1, loc2) )
```

If we shove that function into the `if()` parameters we can use its evaluation of the heterozygous state of the locus to do something interesting, say tell us it is a heterozygote—it is admittedly a contrived example, but hey you try to make easy examples, it is not easy.

```{r}
if( is_heterozygote(loc1) ){
  print("It's a het!")
}
```

If the `is_heterozygote()` function returns a value of FALSE, then the contents of the `if()` function (the stuff within the curly brackets is skipped entirely.

```{r}
if( is_heterozygote(loc2) ){
  print("It's a het!")
}
```

Notice, there was no indication of any of that code inside the curly brackets.
The if-else Pattern
If there are more than on thing you want to potentially do when making a decision, you can add an else clause after the if pattern.  Here if is_heterozygote() returns FALSE, the contents of the else{} clause will be executed.  Here is the heterozygote example

```{r}
if( is_heterozygote(loc1) ) {
  cat(loc1, "is a heterozygote")
} else {
  cat(loc1, "is a homozygote")
}
```

and the homozygote one

```{r}
if( is_heterozygote(loc2) ) {
  cat(loc2, "is a heterozygote")
} else {
  cat(loc2, "is a homozygote")
}
```

There is a slightly shorter version of this that is available for the lazy programmer and lets be honest, all programmers are lazy and the more you can accomplish with fewer strokes on the keyboard the better (this is how we got emacs and vim).  I generally don't teach the shortcuts up front, but this one is short and readily apparent so it may be more helpful than confusing.  The `ifelse()` function has three parts, the condition, the result if `TRUE`, and the result if `FALSE`.

```{r}
ans <- ifelse( is_heterozygote( c(loc1, loc2)) , "heterozygote", "Not")
ans
```

So iterating through the x vector, the condition `x>0` is evaluated and if TRUE the sqrt() of the value is returned, else the NA is given. It is compact and easy to use so you may run into it often.

## The `if-else` Pattern

It is possible to test many conditions in a single sequence by stringing together else-if conditions.  The point that is important here is that the first condition that evaluates to TRUE will be executed and all remaining ones will be skipped, even if they also are logically TRUE.  This means that it is important to figure out the proper order of asking your conditions.  Here is an example function that determines if none, one, or both of the genotypes passed to it are heterozygotes.  By default, I step through every one of the potential options of available on this comparison.    
1. The first is a heterozygote and the second one isn't
2. The first one isn't and the second one is
3. Both are heterozygotes
4. The last state (both are not) 

Here is the function.

```{r}
which_is_het <- function( A, B) {
  if( is_heterozygote(A) & !is_heterozygote(B) ) {
    print("First is heterozygote")
  } else if( !is_heterozygote(A) & is_heterozygote(B) ){
    print("Second is heterozygote")
  } else if( is_heterozygote(A) & is_heterozygote(B) ){
    print("Both are heterozygotes")
  } else {
    print( "Neither are heterozygotes")
  }
}
```

It is possible that the order of these CRITERIA could be changed, the important thing to remember is that the sequence of if - else if - else if etc. will terminate the very first time one of the CRITERIA is evaluated to be `TRUE`.

## Flow Control

Flow control is the process of iterating across objects and perhaps doing operations on those objects.  The R language has several mechanisms that you can use to control the flow of a script or bit of code.  

### The for() Loop

```{r}
x <- c(3,8,5,4,6)
x
```

You can iterate through this vector using a for() loop.  This is a simple function that has the form:

```{r eval=FALSE}
for( SOME_SEQUENCE ){
  DO_SOMETHING
}
```


Where the `SOME_SEQUENCE` component is a sequence of values either specified OR calculated and the `DO_SOMETHING` is the thing you want to do with each of the values in the sequence.  Usually, there is a variable defined in the `SOME_SEQUENCE` component and the value of that variable is used.  Here are a few examples. The first goes through the existing vector directly and assigns (in sequential order) the entries of ‘x' to the variable val. We can then do whatever we want with the value in val (though if we change it, nothing happens to the original x vector).

```{r}
for( val in x ){
  print(val)
}
```

We can also specify a sequence directly and then use it as an index.  Here I use an index variable named i to take on the integer seqeunce equal in length to the length of the original `x` variable.  Then I can iterate through the original vector and use that index variable to grab the value I want.

```{r}
for( i in 1:length(x)){
  print( x[i] )
}
```


Both give us the same output, namely a way to go through the variable `x`. However, there may be a need to use the latter approach in your calculations.  For example, perhaps I want to do some other operation on the values.  In this very contrived example that follows, I want to perform operations on the values in `x` depending on if they are even or odd.  For the odd ones, I add the corresponding value in `y` and if not I subtract it.  Sure, this is totally contrived and I cannot think of a reason why I would be doing this, but if I need to know what index (row, column or whatever) an entry is during the iteration process, then I need to use this approach over the `for( val in x)` approach.

```{r}
y <- 1:5
for( i in 1:length(x)){
  if( x[i] %% 2)
    print( x[i] + y[i])
  else
    print( x[i] - y[i] )
}
```

### Short Circuiting the Loop

It is possible to short circuit the looping process using the keywords next and break, though in my programming style, I consider their use in my source files as evidence of inelegant code.  That said, you may need them on occasion.  

The next keyword basically stops all commands after that during the current iteration of the loop.  It does not terminate the loop itself, it just stops the commands that follow it this time through.  Here is an example that uses the modulus operator, `%%` (e.g., the remainder after division), to print out only those numbers that are divisible by three.

```{r}
for( i in 1:20 ){
  if( i %% 3 )
    next
  cat("The value of i =",i,"\n")
}
```

The use of break to exit the loop entirely is perhaps more commonly encountered.  When this keyword is encountered, the loop terminates immediately, as if it reached the send of the sequence.  

```{r}
for( i in 1:10){
  if( i > 2 )
    break
  cat("The value of i=",i,"\n")
}
```


<!--chapter:end:r_programming.rmd-->

# Graphics {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_fallleaves.jpg"></div>

One of the most critical features of data analysis is the ability to present your results in a logical and meaningful fashion.  R has built-in functions that can provide you graphical output that will suffice for your understanding and interpretations.  However, there are also third-party packages that make some truly beautiful output.  In this section, both built-in graphics and graphical output from the ggplot2 library are explained and highlighted.  

For this section we are going to use the venerable iris dataset that was used by Anderson (1935) and Fisher (1936). These data are measurements of four morphological variables (Sepal Length, Sepal Width, Petal Length, and Petal Width) measured on fifty individual iris plants from three recognized species.  Here is a summary of this data set.

```{r}
summary(iris)
```

I will also provide examples using two different plotting approaches. R has a robust set of built-in graphical routines that you can use.  However, the development community has also provided several additional graphics libraries available for creating output.  The one I prefer is `ggplot2` written by Hadley Wickham.  His approach in designing this library a philosophy of graphics depicted in Leland Wilkson's (2005) book *The Grammar of Graphics*. As I understand it, the idea is that a graphical display consists of several layers of information. These layers may include:

* The underlying data.  
* Mapping of the data onto one or more axes.  
* Geometric representations of data as points, lines, and/or areas.  
* Transformations of the axes into different coordinate spaces (e.g., cartesian, polar, etc.) or the data onto different scales (e.g., logrithmic)  
* Specification of subplots.

In the normal plotting routines discussed before, configuration of these layers were specified as arguments passed to the plotting function (`plot()`, `boxplot()`, etc.). The ggplot2 library takes a different approach, allowing you to specify these components separately and literally add them together like components of a linear model.

## Univariate Data

Univariate data can represent either counts (e.g., integers) of items or vectors of data that have decimal components (e.g., frequency distributions, etc.).  If the sampling units are discrete, then a the `barplot()` function can make a nice visual representation.  Here is an example using some fictions data.

```{r}
x <- c(2, 3, 6, 3, 2, 4)
names(x) <- c("Bob", "Alice", "Jane", "Fred", "Barney", "Lucy")
x
```

From this, a barplot can be constructed where the x-axis has discrete entities for each name and the y-axis represents the magnitude of whatever it is we are measuring.

```{r}
barplot(x, xlab = "People", ylab = "Thinn-A-Ma-Jigs")
```

Notice here that I included labels for both the x- and y-axes.  You do this by inserting these arguments into the parameters passed to the `barplot()` (it is the same for all built-in plotting as we will see below).

To use `ggplot` for displaying this, you have to present the data in a slightly different way.  There are shortcuts using the `qplot()` function but I prefer to use the more verbose approach.  Here is how we would plot the same output using ggplot.

```{r warning=FALSE,message=FALSE}
library(ggplot2)
df <- data.frame(Thinn_A_Ma_Jigs=x,names=names(x))
ggplot( df, aes(x=names,y=Thinn_A_Ma_Jigs)) + geom_bar(stat="identity")
```

There are a couple of things to point out about this.   

1. This is a compound statement, the plotting commands are literally 'added together'.
2. The `ggplot()` function specifies a `data.frame` from which your data will be extracted.
3. The `aes()` component within the `ggplot()` function describes the aesthetics of the plot.  This is how you designate which variable columns in the `data.frame` will be on the x- and y- axes, colors, plot shapes, categories, etc.
4. To the basic description of the data and aesthetics, a geometry is added, that of a barplot.  If necessary, you can pass additional information to the `geom_bar()` function.  In this case, I had to tell it that you use the raw data `stat='identity'` instead of trying to summarize many observations.  
5. The axis labels are taken directly from the names of the columns in the `data.frame`.

If the data on the x-axis is not discrete but measured on a single variable, then the `hist()` function can take the data and categorize it into groups based upon the density of the underlying data.

```{r}
h <- hist(iris$Sepal.Length)
```

The `hist()` function itself returns a bit of information that may be of interest.  It is not necessary that you capture these data, R will ignore it if you do not assign it to a variable.  What is of interest though is that it returns a `list()` object with the following keys:

```{r}
names(h)
```

The data within these keys can be accessed directly as:

```{r}
h
```

and provide the basic information on how to construct the histogram as shown below.  The values defined for the break locations for each of the bars and the midpoints are the default ones for the `hist()` function.  If you wish, you can provide the number of breaks in the data, the location of the breaks, including the lowest and highest terms, plot as a count versus a frequency, etc.  

The corresponding `ggplot` approach is

```{r}
ggplot( iris, aes(x=Sepal.Length)) + geom_bar()
```

which also takes a few liberties with the default binning parameters.  

By default, both the R and `ggplot` approches make a stab at creating enough of the meta data around the plot to make it somewhat useful.  However, it is helpful if we add a bit to that and create graphics that suck just a little bit less. There are many options that we can add to the standard plot command to make it more informative and/or appropriate for your use.  The most common ones include:

* `xlab` Change the label on the x-axis  
* `ylab` Change the lable on the y-axis 
* `main` Change the title of the graph
* `col` Either a single value, or a vector of values, indicating the color to be plot.
* `xlim/ylim` The limits for the x- and y-axes.

Additional options are given in tabular form in the next section.  

Here is another plot of the same data but spiffied up a bit. 

```{r}
x <- hist( iris$Sepal.Length,xlab="Sepal Length (cm)", 
           ylab="Count", main="", col="lightblue")
```

A few things to notice:

1. The main title over the top of the image was visually removed by assigning an empty characters string to it.   
2. The color parameter refers to the fill color of the bars, not the border color.

In `ggplot` we add those additional components either to the geometry where it is defined (e.g., the color in the geom_bar) or to the overall plot (as in the addition of labels on the axes).

```{r}
ggplot( iris, aes(x=Sepal.Length)) + geom_bar(fill="lightblue") + 
  xlab("Sepal Length (cm)") + ylab("Count")
```


## Plotting Parameters

There are several parameters common to the various plotting functions used in the basic R plotting functions.  Here is a short table of the most commonly used ones.

Command  |  Usage           | Description
---------|------------------|-------------------------------------------------
bg       | bg="white"       | Sets the background color for the entire figure.
bty      | bty="n"	        | Sets the style of the box type around the graph. Useful values are "o" for complete box (the default), "l", "7", "c", "u", "]" which will make a box with sides around the plot area resembling the upper case version of these letters, and "n" for no box.
cex      | cex=2            | Magnifies the default font size by the corresponding factor.
cex.axis | cex.axis=2       | Magnifies the font size on the axes.
col      | col="blue"       | Sets the plot color for the points, lines, etc.
fg       | fg="blue"        | Sets the foreground color for the image.
lty      | lty=1            | Sets the type of line as 0-none, 1-solid, 2-dashed, 3-dotted, etc.
lwd      | lwd=3            | Specifies the width of the line.
main     | main="title"     | Sets the title to be displayed over the top of the graph.
mfrow    | mfrow=c(2,2)     | Creates a ‘matrix' of plots in a single figure.
pch      | pch=16	          | Sets the type of symbol to be used in a scatter plot.
sub      | sub="subtitle"   | Sets the subtitle under the main title on the graph.
type     | type="l"         | Specifies the type of the graph to be shown using a generic plot() command.  Types are “p" for points (default), “l" for lines, and “b" for both.
xlab     | xlab="Size (m)"  | Sets the label attached to the x-axis.
ylab     | ylab="Frequency" | Sets the label attached to the y-axis.


## Density Plots

Another way of looking at the distribution of univariate data is through the use of the `density()` function. This function takes a vector of values and creates a probability density function of it returning an object of class "density". 

```{r}
d <- density( iris$Sepal.Length) 
attributes(d)
```

The `density()` function takes the univariate data and fits its density (internally represented by `d$y`) along the range of values (in `d$x`).

A plot can be produced using d as the variables as follows (with a bit of extra plotting parameters as depicted in the previous table).

```{r}
plot( d, col = "red", lwd = 2, 
      xlab = "Value of X", ylab = "Frequency", 
      bty = "n", main = "Density Plot")
```

The corresponding `ggplot` approach is:

```{r}
ggplot( iris, aes(x=Sepal.Length)) + geom_density()
```


## Overlaying Plots

The data in the previous density plot represents the sepal lengths across all three iris species.  It may be more useful to plot them as the density of each species instead of combined.  Overlaying plots is a pretty easy feature built into the default R plotting functionalities.

Lets look at the data and see if the mean length differs across species. I do this using the `by()` command, which takes three parameters; the first parameter is the raw data you want to examine, the second one is the way you would like to partition that data, and the third one is the function you want to call on those groupings of raw data.  This general form

```{r eval=FALSE}
by( data, grouping, function )
```

mixing both data and functions to be used, is pretty common in R and you will see it over and over again.  The generic call below asks to take the sepal data and partition it by species and estimate the mean.

```{r}
by( iris$Sepal.Length, iris$Species, mean)
```

So there may be differences.  Lets pull the data out and create density plots for each.

```{r}
d.setosa <- iris$Sepal.Length[ iris$Species=="setosa" ]
d.versicolor <- iris$Sepal.Length[ iris$Species=="versicolor" ]
d.virginica <- iris$Sepal.Length[ iris$Species=="virginica" ]
d.se <- density( d.setosa )
d.ve <- density( d.versicolor )
d.vi <- density( d.virginica )
```

I can now plot the densities independently.  After the first `plot()` function is called, you can add to it by using `lines()` or `points()` function calls.  They will overlay the subsequent plots over the first one.  One of the things you need to be careful of is that you need to make sure the x- and y-axes are properly scaled so that subsequent calls to `points()` or `lines()` does not plot stuff outside the boundaries of your initial plot.

Here I plot the setosa data first, specify the `xlim` (limits of the x-axis), set the color, and labels.  On subsequent plotting calls, I do not specify labels but set alternative colors.  Then a nice legend is placed on the graph, the coordinates of which are specified on the values of the x- and y-axis (I also dislike the boxes around graphic so I remove them as well with `bty="n"`).

```{r}
plot(d.se,xlim=c(4,8),col="red", lwd=2, bty="n", 
     xlab="Sepal Length (cm)", main="Sepal Lengths")
lines( d.ve, xlim=c(4,8), col="green",lwd=2, bty="n")
lines( d.vi, xlim=c(4,8), col="blue", lwd=2, bty="n")
legend( 6.5,1.1,c("I. setosa", "I. versicolor", "I. virginica"), 
        col=c("red","green","blue"), lwd=2,bty="n")
```

A lot of that background material is unnecessary using `geom_density()` because we can specify to the plotting commands that the data are partitioned by the values contained in the `Species` column of the `data.frame`. This allows us to make this plot

```{r}
ggplot(iris, aes(Sepal.Length,color=Species)) + geom_density()
```

or a correpsonding one using `fill=Species` instead (I set the alpha transparency to allow you to see the plots in the background).

```{r}
ggplot(iris, aes(Sepal.Length,fill=Species)) + geom_density( alpha=0.8)
```

We can use the `barplot()` function here as well and either stack or stagger the density of sepal lengths using discrete bars.   

Here I make a matrix of bins using the `hist()` function with a specified set of breaks and then use it to plot discrete bin counts using the `barplot()` function.  I include this example here because there are times when you want to produce a stacked bar plot (rarely) or a staggered barplot (more common) for some univariate data source and I always forget how to specifically do that.

```{r}
breaks <- seq(4,8,by=0.2)
h.se <- hist(d.setosa, breaks=breaks, plot = FALSE)
h.ve <- hist(d.versicolor, breaks=breaks, plot=FALSE)
h.vi <- hist(d.virginica, breaks=breaks, plot=FALSE)
vals <- rbind( h.se$counts, h.ve$counts, h.vi$counts )
rownames(vals) <- levels(iris$Species)
colnames(vals) <- breaks[1:20]
vals
```

The matrix of data 

```{r}
barplot(vals,xlab="Sepal Length", ylab="Frequency")
```

Stacked barplots may or may not be that informative, depending upon the complexity of your underlying data.  It is helpful though to be able to stagger them.  In the basic `barplot()` function allows you to specify the bars to be spaced beside each other as:

```{r}
barplot(vals,xlab="Sepal Length", ylab="Frequency", col=c("red","green","blue"), beside=TRUE)
legend(60, 10, c("I. setosa", "I. versicolor", "I. virginica"), fill = c("red", "green", "blue"), bty="n")
```

For plotting onto a barplot object, the x-axis number is not based upon the labels on the x-axis. It is an integer that relates to the number of bar-widths and separations. In this example, there are three bars for each category plus one separator (e.g., the area between two categories). So I had to plot the legend at 60 units on the x-coordinate, which puts it just after the 15th category (e.g., 15*4=60).


We can do the same thing with `geom_histogram()`, though again with a bit less typing involved.  Here is the raw plot (which by default stacks just like in the `barplot()` example)

```{r}
ggplot( iris, aes(x=Sepal.Length, fill=Species)) + geom_histogram()
```

and the correspondingly staggered plot with bars positioned next to eachother.

```{r}
ggplot( iris, aes(x=Sepal.Length, fill=Species)) + geom_histogram(position="dodge")
```

Notice how we add the fill color to the `aes()` function because the categories for assigning colors will be extracted from the `data.frame`, whereas when we just wanted to set them all light blue, the color is specified outside the `aes()` function.  Also notice when we specify something to the `aes()` command in this way, we do not quote the name of the column, we call it just as if it were a normal variable.


## Boxplots

Boxplots are a display of distributions of continuous data within categories, much like what we displayed in the previous staggered barplot.  It is often the case that boxplots depict differences in mean values while giving some indication of the dispersal around that mean.  This can be done in the staggered barplot as above but not quite as specifically.

The default display for boxplots in R provides the requires a factor and a numeric data type.  The factor will be used as the categories for the x-axis on which the following visual components will be displayed:

* The median of the data for each factor,  
* An optional 5% confidence interval (shown by passing notch=TRUE) around the mean.
* The inner 50th percentile of the data,  
* Outlier data points.

Here is how the sepal length can be plot as a function of the Iris species.

```{r}
boxplot(Sepal.Length ~ Species, data=iris, notch=TRUE, 
        xlab="Species", ylab="Sepal Length (cm)", 
        frame.plot=FALSE)
```

A couple of things should be noted on this plot call.  First, I used a ‘functional' form for the relationship between the continuous variable (`Sepal.Length`) and the factor (`Species`) with the response variable indicated on the left side of the tilde and the predictor variables on the right side.  It is just as reasonable to use `boxplot(Species,Sepal.Length, ...)` in the call as x- and y- variables in the first two positions.  However, it reads a bit better to do it as a function, like a regression.  Also, it should be noted that I added the optional term, `data=iris`, to the call.  This allowed me to reference the columns in the iris data.frame without the dollar sign notation.  Without that I would have had to write `boxplot( data$Sepal.Length ~ data$Species, ...)`, a much more verbose way of doing it but most programmers are relatively lazy and anything they can do to get more functionality out of less typing...  Finally, I set `frame.plot=FALSE` to remove the box around the plot since the `bty="n"` does not work on boxplots (big shout out for consistency!).  I don't know why these are different, I just hate the box.

The corresponding `ggplot` approach is the same as the `boxplot()` one, we just have to specify the `notch=TRUE` in the `geom_boxplot()` function.

```{r}
ggplot(iris, aes(x=Species, y=Sepal.Length)) + 
  geom_boxplot(notch = TRUE) + 
  xlab("Iris Species") + ylab("Sepal Width (cm)")
```

## Scatter Plots

With two sets of continuous data, we can produce scatter plots.  These consist of 2- (or 3-) coordinates where we can plot our data.  With the addition of alternative colors (`col=`) and/or plot shapes (`pch=`) passed to the generic plot() command, we can make really informative graphical output. Here I plot two characteristics of the iris data set, sepal length and width, and use the species to indicate alternative symbols.

```{r}
plot( iris$Sepal.Length, iris$Sepal.Width, 
      pch=as.numeric(iris$Species), bty="n", 
      xlab="Sepal Length (cm)", ylab="Sepal Width (cm)")
legend( 6.75, 4.3, c("I. setosa", "I. versicolor", "I. virginica"), 
        pch=1:3, bty="n")
```

The call to `pch=` that I used coerced a factor into a numeric data type.  By default, this will create a numeric sequence (starting at 1) for all levels of the factor, including ones that may not be in the data you are doing the conversion on.  The parameter passed to `pch` is an integer that determines the shape of the symbol being plot. I typically forget which numbers correspond to which symbols (there are 25 of them in total ) and have to look them up when I need them.  One of the easiest ways is just to plot them as:

```{r}
plot(1:25,1:25,pch=1:25, bty="n", xlab="pch", ylab="pch")
```

which produces the following figure where you can choose the appropriate symbols for your plotting needs.

A scatter plot in `ggplot` is created using the `geom_point()` layer.

```{r}
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width, shape=Species)) + geom_point()
```

and the shapes are specified automatically.


## Multiple Plots

The last figure was a bit confusing.  You can see a few points where the three species have the same measurements for both sepal length and width—both *I. versicolor* and *I. virginica* have examples with sepal widths of 3.0cm and lengths of 6.7cm.  For some of these it would be difficult to determine using a graphical output like this if there were more overlap.  In this case, it may be a more informative approach to make plots for each species rather than using different symbols.  

In `ggplot`, there is a facet geometry that can be added to a plot that can pull apart the individual species plots (in this case) and plot them either next to each other (if you are looking at y-axis differences), on top of each other (for x-axis comparisons), or as a grid (for both x- and y- axis comparisons).  This layer is added to the plot using `facet_grid()` and take a functional argument (as I used for the boxplot example above) on which factors will define rows and columns of plots.  Here is an example where I stack plots by species.

```{r}
ggplot( iris, aes(x=Sepal.Width, y=Sepal.Length ) ) + 
  geom_point() + facet_grid(Species~.) + 
  xlab("Sepal Width (cm)") + ylab("Sepal Length")
```

The `Species ~ .` means that the rows will be defined by the levels indicatd in `iris$Species` and all the plotting (the period part) will be done as columns.  If you have more than one factor, you can specify `RowFactor ~ ColumnFactor` and it will make the corresponding grid. 

I would argue that this is a much more intuitive display of differences in width than the previous plot.

You can achieve a similar effect using built-in plotting functions as well.  To do this, we need to mess around a bit with the plotting attributes.  These are default parameters (hence the name `par`) for plotting that are used each time we make a new graphic.  Here are all of them (consult with the previous Table for some of the meanings).

```{r}
names( par() )
```

To create multiple plots on one graphic, we need to modify either the mfrow or mfcol property (either will do).  They represent the number of figures to plot as a 2-integer vector for rows and columns.  By default, it is set to

```{r}
par()$mfrow
```

because there is only 1 row and 1 column in the plot.  To change this, we simply pass a new value to the `par()` command and then do the plotting.  In the following example, I plot the results of a linear model (`lm()`) function call. This returns four plots looking at the normality of the data, residuals, etc.  This time, instead of seeing them one at a time, I'm going to plot all four into one figure.

```{r}
par(mfrow=c(2,2))
plot( lm( Sepal.Length ~ Sepal.Width, data=iris))
```


## Color Palettes

The default plotting colors in R are black and white.  However, there is a rich set of colors available for your plotting needs.  The easiest set are named colors.  At the time of this writing, there are 

```{r}
length(colors())
```

different colors available for your use. Not all are distinct as some overlap.  However the benefit of these colors is that they have specific names, making it easier for you to remember than RGB or hex representations.  Here are a random set of 20 color names.

```{r}
colors()[ sample.int( length(colors()), size=20) ]
```

To use these colors you can call them by name in the col= option to a plot.  Here is an example where I define three named colors and then coerce the `iris$Species` variable into an integer to select the color by species and plot it in a scatter plot (another version of the `pch=` example previously).

```{r}
colors <- c("royalblue1", "orange1", "green3")
cols <- colors[ iris$Species ]
plot( Sepal.Width ~ Sepal.Length, data=iris, 
      col=cols, xlab="Sepal Length (cm)", ylab="Sepal Width (cm)", 
      bty="n", pch=16)
```


There is a lot of colors to choose from, and you are undoubtedly able to fit any color scheme on any presentation you may encounter.

In addition to named colors, there are color palettes available for you to grab the hex value for individual color along a pre-defined gradient.  These colors ramps are:

* `rainbow()`:  A palette covering the visible spectrum  
* `heat.colors()`: A palette ranging from red, through orange and yellow, to white.  
* `terrain.colors()`: A palette for plotting topography with lower values as green and increasing through yellow, brown, and finally white.   
* `topo.colors()`: A palette starting at dark blue (water) and going through green (land) and yellow to beige (mountains).   
* `cm.colors()`: A palette going from light blue through white to pink.  

and are displayed in the following figure.

The individual palette functions return the hex value for an equally separated number of colors along the palette, you only need to ask for the number of colors you are requesting.  Here is an example from the `rainbow()` function.

```{r}
rainbow(10)
```


## RColorBrewer

The ggplot2 library has a slightly more interesting color palette, but it too is often a bit lacking.  You have total control over the colors you produce and can specify them as RGB, hex (e.g. the way the web makes colors), or CMYK.  You can also define your own color palettes.  

To start, the RColorBrewer library provides several palettes for plotting either quantitative, qualitative, or divergent data.  These three groups and the names of individual color palettes can be viewed as:

```{r}
library(RColorBrewer)
display.brewer.all()
```

In ggplot, we can use these palettes as follows.

You can change the default palette in ggplot output adding the `scale_color_brewer()` (and `scale_fill_brewer()` if you are doing fill colors like in a barplot) to the plot.  Here is an example where I change the default palette to the 6th divergent palette.

```{r}
p <- ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width, color=Species)) + geom_point() 
p <- p + xlab("Sepal Length (cm)") + ylab("Sepal Width (cm)")
p + scale_color_brewer(type="div", palette=6)
```


## Saving Imagery

Creating a graphic for display on your screen is only the first step.  For it to be truly useful, you need to save it to a file, often with specific parameters such as DPI or image size, so that you can use it outside of R.  

For normal R graphics, you can copy the current graphical device to a file using `dev.copy()` or you can create a graphic file object and plot directly to it instead of the display.  In both cases, you will need to make sure that you have your graphic formatted the way you like.  Once you have the file configured the way you like, decide on the image format you will want.  Common ones are jpg, png and tiff.  Here are the function arguments for each of the formats.

```
jpeg(filename = "Rplot%03d.jpeg",
     width = 480, height = 480, units = "px", pointsize = 12,
     quality = 75,
     bg = "white", res = NA, ...,
     type = c("cairo", "Xlib", "quartz"), antialias)

png(filename = "Rplot%03d.png",
    width = 480, height = 480, units = "px", pointsize = 12,
     bg = "white",  res = NA, ...,
    type = c("cairo", "cairo-png", "Xlib", "quartz"), antialias)

tiff(filename = "Rplot%03d.tiff",
     width = 480, height = 480, units = "px", pointsize = 12,
     compression = c("none", "rle", "lzw", "jpeg", "zip", "lzw+p", "zip+p"),
     bg = "white", res = NA,  ...,
     type = c("cairo", "Xlib", "quartz"), antialias)
```
     
These are the default values.  If you need a transparent background, it is easiest to use the png file and set `bg="transparent"`.  If you are producing an image for a publication, you will most likely be given specific parameters about the image format in terms of DPI, pointsize, dimensions, etc.  They can be set here.

To create the image, call the appropriate function above, and a file will be created for your plotting.  You must then call the plotting functions to create your graphic.  Instead of showing up in a graphical display window, they will instead be plot to the file. When done you must tell R that your plotting is now finished and it can close the graphics file.  This is done by calling dev.off().  Here is an example workflow saving an image of a scatter plot with a smoothed line through it using ggplot.

```{r eval=FALSE}
png( filename = "MyCoolGraphic.png", quality=300, bg="transparent", width=1080, height=1080)
ggplot( df, aes(x=PredictorVar, y=ResponseVar)) + geom_point() + stat_smooth(method="loess") + theme_bw()
dev.off()
```

This will create a file MyCoolGraphic.png saved in the same folder as the current working directory.  

To use `dev.copy()` instead, you first make the graphic and then copy the current graphic to a file.  For this example, it would be:

```{r eval=FALSE}
ggplot( df, aes(x=PredictorVar, y=ResponseVar)) + geom_point() + stat_smooth(method="loess") + theme_bw()
dev.copy(device=png,file="MyDevCopyGraphic.png")
dev.off()
```

In general, I prefer the first method as it allows me to specify the specifics of the file in an easier fashion than the second approach.

If you are using ggplot2 graphics, there is a built-in function ggsave() that can also be used to save your currently displaying graphic (or others if you so specify) to file.  Here are the specifics on that function call.

```{r eval=FALSE}
ggsave(filename = default_name(plot), plot = last_plot(),
  device = default_device(filename), path = NULL, scale = 1,
  width = par("din")[1], height = par("din")[2], units = c("in", "cm",
  "mm"), dpi = 300, limitsize = TRUE, ...)
```

Either approach will allow you to produce publication quality graphics.


## Interactive Graphics

There are several other libraries available for creating useful plots in R.  The most dynamical ones are based on javascript and as such are restricted to web or ebook displays.  Throughout this book, I will be showing how you can use various libraries to create interactive content as well as providing some 'widgets' that will demonstrate how some feature of popualtion or landscape structure changes under a specified model.  These widgets are also written in R and are hosted online as a [shiny](http://shiny.rstudio.org) application.  I would encourage you to look into creating dynamcial reports using these techniques, they *can* facilitate a better understanding of the underlying biology than static approaches.


<!--chapter:end:graphics.rmd-->

# Reproducable Research  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_cactus.jpg"></div>

As a scientist, the integrity of your data is your #1 priority. In fact, I argue that your ability to write advertisements of your research, which is what manuscripts really are, is secondary to the data that you collect and provide the community. As a practitioner, you get more credit and are providing more to your research community if your data are not only collected with integrity but able to be used by others. In the near future, the DOI for your data sets will be as important as the ones for your manuscripts.

For your data to have the most impact, it needs to be both available and valid. One way to help keep your data intact is to put it in one place and only pull from there to do analyses when you need to do them. You should ALWAYS keep your data and your analyses as close to each other as possible. Once they get separated, things go crazy. For example, consider the case where you send your colleague a copy of your data in a spreadsheet. Now you have two potentially different (and divergent) representations of your data. Any change that either you or your colleague make to the data is not mirrored. Which one is correct? What happens when you put a copy on the laboratory computer and do some analyses with it, then copy it to your ‘backup drive' (or even worse a thumb drive) and then put a copy of it on your laptop. Then when you get home, a copy goes on your home computer. How many different copies of your data are there in the world? Which one is correct? Somewhere down the road, you will need to recreate a figure for a manuscript, give someone a copy of your data, reanalyze your raw data with a new approach that you run across, etc. Which copy of the data that you haven't worked on for the last 6+ months is the right one? Which one is valid? Even if you use an awesome naming scheme like “Data.last.version.xls", “Data.really.last.version.xls", “I.totally.mean.last.version.xls" etc.. Not going to serve you well in the long run. All of this is terrifying from the perspective of reproducibility.

One way to keep data available with integrity is to have it located in one, and only one, location. If you use Google Drive, this is a pretty easy thing to do. You can keep your spreadsheet online, and then just access the data for your analysis when you need to.

## Keeping Data on Google Drive

Here is how you set that up. Upload your data into a Google Spreadsheet and then select File $\to$ Publish to the Web...  

You will be given a dialog (as shown here) that allows you to determine how much of the spreadsheet to share.  You can publish a particular sheet or the entire document.  For our purposes, it is probably best to keep our data on a single sheet within the file and publish them independently.  You can also determine the format you want the data presented in, at present the following are available:

* Web page - An html equivalent of the page  
* Comma-separated values (*.csv) - This is the preferred format for our purposes.  
* Tab-separated values (*.tsv) - Columns separated by tabs.  This is also an option for our needs.  
* PDF Document (*.pdf) - Portable Document Formats for general dissemination.  
* Microsoft Excel (*.xlsx) - Excel formats are great for excel but not openly available.
* OpenDocument Spreadsheet (*.ods) - Spreadsheet file format from OpenOffice.

```{r echo=FALSE}
knitr::include_graphics("./media/GoogleShare.png")
```

It is in my opinion preferable to keep your data in a format that is accessible to the broadest audience.  It is not a secret that programs change file formats and old proprietary binary formats can become corrupted and unreadable.  It would be a shame to keep the data that you work so hard on in a format that you can no longer access or that your potential future colleagues and collaborators cannot get access to...

For this example, the data from the first sheet are provided as a *.csv file and can be accessed by that long URL.

In R, we can access this spreadsheet directly through that URL, which allows us to have our data located in a single centralized location and accessible by a broad range of applications and uses, none of which have the ability to duplicate your data (thereby causing a fork) or make any changes to its internal structure.  Here is how it is done.

1. Load in the `gstudio` library
2. On GoogleSheets, publish your spreadsheet as a csv file.
3. Use the `read_population()` function to grab the data.

```{r message=FALSE}
library(gstudio)
link <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vS2fBVXzjaFTsbYX10mpTyugSLpzjyGLYE_UmLDqTyeJnzsayFgBQjnlseXgR72XE1gYc4s5oTiZLOz/pub?gid=0&single=true&output=csv"
cornus <- read_population( link, type="column", locus.columns = 5:14, sep=",")
summary(cornus)
```

The key here is that the contents of the data file are available but you cannot manipulate it. If you do need to make changes to the raw data, you can do this through the Google Drive interface.  Once you do that, all you would have to do is rerun your scripts and every one of them will point to the only and latest version of the data set.

I recommend looking into more approaches on Reproducible Research as they will help you, professionally, get the most out of your efforts.


## Integrating into Slack

If you and your group use [Slack](http://slack.com) to help in productivity, you can post text and results directly from R into your Slack channels without the copy-paste, save file, upload, etc. routine. Here is how you set it up.  This is still a bit in development and the latest version needs a little tweeking to work correctly (see subsection below).  

While this package is currently on CRAN, I recommend downloading the latest version from github.
First things first, I recommend installing the latest version from the github repository.

```{r eval=FALSE}
library(devtools)
install_github("hrbrmstr/slackr")
```

### Fixing Problem with Version 1.4.1

There is a slight problem though.  The current version of the slackr library has an error in it associated with (perhaps) a recent change in the Slack API that has not been fixed by the developer.

For me to get this to work, I had to compile the package myself after making the following change in one file.  To fix it, do the following:  

1. Download (or checkout) the repository from github at: https://github.com/hrbrmstr/slackr  
2. Open the project in RStudio  
3. Open the R file names `slackr_utils.R`  
4. In the function named `slackr_ims`  the last line (line 117) is something like `dplyr::left_join( %some stuff% )`.  Replace this line with `suppressWarnings( merge(users, ims, by.x="id", by.y='user') )`
5. The compile and install the package as:
```{r eval=FALSE}
library(devtools)
load_all()
build()
install()
```


It should work just fine from then on.  Hopefully, on the next time that this package is updated by the author, the `left_join()` problem will have been resolved.  This issue had been marked as “resolved” in the github issues a while back but apparently not pushed to the repository.


### Configuration

You can configure it from the command line, but perhaps the easiest way to get it going is to set up a config file. By default, it will look for the file `~/.slackr`  This is a plain text file (actually it is a Debian Control File - DCF) and needs the following content

<pre>
api_token: xoxp-XXXXXXXXXXX-XXXXXXXXX-XXXXXXXXXX-XXXXXXXXX
channel: #r
username: rodney
incoming_webhook_url: https://hooks.slack.com/services/XXXXXXXX/XXXXXXXX/XXXXXXXX
</pre>

You need to get the `api_token` and the `incoming_webhook_url` from slack itself.  Save that file.

### Sending Text and Graphical Ouput

Both textual output (e.g., the results of an analysis) or graphical output (such as `ggplot` objects) can be sent to your slack channels now.  Here are some examples:

```{r echo=FALSE, eval=FALSE}
library(slackr)
slackr_setup()
slackr("This is an incoming piece of text from RStudio")
```


Which results in the following in my #r slack channel:

```{r echo=FALSE}
knitr::include_graphics("media/slack_text_output.png")
```


There is also a provision for sending output graphics like `ggplot` objects.  Here is an example of a barplot of heterozygosity across all loci in the included `arapat` dataset.

```{r eval=FALSE}
library(gstudio)
library(ggplot2)
data(arapat)
he <- genetic_diversity(arapat,mode="He")
p <- ggplot( he, aes(x=Locus, y=He)) + geom_bar(stat="identity") + theme_bw()
ggslack(p)
```

Which produces the following image in my channel

```{r echo=FALSE}
knitr::include_graphics("media/slack_ggplot_output.png")
```

Very Cool!  I can imagine a lot of situations where it would be helpful to have a script send a slack notification when it has finished (or crapped out on an error) in some long-running analyses.  Now if we could just get slack to integrate with RStudio server version... 


### Markdown

This entire book is written in [RMarkdown](http://rmarkdown.rstudio.com/), a text-based approach for mixing verbiage and analyses into a single document.  There are several advantages to adopting an open appraoch such as this including:  
1. All the data and analyses are in the same place.  The only reason you see a graph above is because the markdown document has the specific code in it to make that graphical output.  It is not done in some other program and either imported or cut-and-paste into this document.  
2. It is in a text format.  If you wish, you can see the entirety of this text on my [github](https://github.com/dyerlab/applied_population_genetics) site.  

This means I can take advantage of:
  - Version control. I can make branches, develop new parts, and merge them back in.
  - No need for those *ad hoc* naming schemes (last_version.docx, really_last_version.docx, Im_totally_lost_which_version_is_this.docx), it is all kept in sync.
  - Collaborations with others is easy, just like real programmers.
3. Analyses and figures can be re-run and changes in the output can be inserted into your documents automatically.
4. Markdown documents are the source and can be converted into HTML, PDF, DOCX, presentations, etc.  All from the same source materials.  This means that you work on one file and when you need to convert it into a finalized format, you can do so with ease.

To get started working with markdown, go read over the [RMarkdown](http://rmarkdown.rstudio.com/) website and fire up RStudio. It has all the components built-in for you.  


<!--chapter:end:reproducable_research.rmd-->

# (PART) Single Population {-}

<!--chapter:end:part_single_population.rmd-->

# Hardy-Weinberg Equilibrium {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_flounder.jpg"></div>

At the heart of population genetics is the expectation that genotypes will occur at predictable frequencies given a set of assumptions about the underlying population.  This is formulated in the enigmatic Hardy-Weinberg Equation, $p^2 + 2pq + q^2 = 1$.  

In this chapter, we delve in to what that means, where it comes from, and how we can gain biological inferences from data that do not follow. 

The underlying idea of Hardy-Weinberg Equilibrium (HWE for brevity) is that genotypes should occur at frequencies as predicted by probability theory alone IF (and this is the big part), the population we are looking at is mating in particular ways.  Lets spend a little time talking about what that means and what these assumptions actually are. 

## Genetic Assumptions
 
Implicit in the description Above is a characterization of what constitutes a genotype.  For simplicity, as this is the way it was originally defined, we will assume that the species we are looking at is diploid, carrying homologous chromosomes from both parents.  Diploidy means that in the adult life stage, each individual has two copies of each allele, one from each parental individual. If diploidy is true, we can denote the genotype as $AA$ for the diploid homozygote, $AB$ for the heterozygote, and $BB$ for the other diploid homozygote.  If it is not true and the species (at least at that life stage) is haploid, then will will represent the genotype as $A$ or $B$. Higher levels of ploidy (triploid, tetraploid, hexaploid, etc.) are also possible in many taxa.  While the math that follows is essentially the same for these ploidy levels, the actual algebra is a bit more messy.  If you are working on taxa with higher ploidy levels, you can do many of the same operations we will focus on in gstudio, consult the documentation for more information.

In addition to ploidy, genetic constraints on the formulation of the classic HWE assume that the species has two separate sexes, both of which contribute to the next generation.  In general, this rules out uniparentally inherited markers such as mtDNA or cpDNA, which are transmitted in most cases by only one of the sexes during reproduction.  Examining mating events between diploid individuals is something we should all remember from basic biology and genetics through the use of the Punnet Square.  That was a tool designed to teach us about sexual mating, however, it is also a great example of the probabilities of mating we see in sexual reproduction.  These probabilities play direct roles in the derivation of HWE.

Two additional constraints arise directly from diploid sexual reproduction.  First, we assume that the likelihood of a particular allele is independent of the sex of the individual that contains it.  For example, male individuals should have as many alleles (e.g., no sex chromosomes that mix ploidy of genotypes) and alleles at that can occur with equal frequency as the other sex (no sex biased alleles probabilities as may arise from say sex-biased lethals).  

Finally, we will invoke both of Mendel's Laws of Inheritance.  His first law states that the alleles at a particular locus are statistically independent from each other.  If there is an A allele provided by one parent, the probability of the second allele is either $A$ or $B$ is entirely independent of that first alleles state.  The second law, though not necessarily relevant for many HWE applications, deals with the probability of genotypes at two loci.  Here we assume that the alleles present at one locus are statistically independent of those at the other locus.  We will return to this when we get to linkage but for now lets assume it is a valid assumption.


### Punnet Squares

A Punnet square is a simple tool used to teach transition probabilities in basic genetics and biology courses.  Here is an interactive example using the 2-allele locus (though the tool can handle other ploidy levels) that you can play with to see the kinds of offspring genotypes produced by individual matings.  

```{r echo=FALSE}
knitr::include_app("http://128.172.160.215/punnet_square/", height="500")
```

The population we are examining need to have some specific demographic parameters for HWE to be valid.  The most basal assumption that is invoked is that the population size must be very large.  In a fact, most of the work of R.A. Fisher and J.B.S. Haldane in developing the modern synthesis relied upon this assumption as well.  If population size, $N$, is too small, then the stochastic change (what we call genetic drift) can have serious influences on allele frequencies.  

Next, and perhaps this is more for our laziness than any other reason, we assume that the generations do not overlap.  This means that the during a single generation, all individuals participate in the mating events, each of which is equally likely to be selected as one participating in a mating event (e.g., random mating).  For simplicity, lets denote the frequencies of these genotypes as $f(AA) = P$, $f(AB) = Q$, and $f(BB) = R$. If these individuals are really mating randomly with respect to their genotypes, then the probability of union between any pair is simply the product of their population frequencies.  


&nbsp; | **AA** | **AB** | **BB**
:-----:|:------:|:------:|:-------:
**AA** | $P^2$  | $PQ$   | $PR$
**AB** | $PQ$   | $Q^2$  | $QR$
**BB** | $PR$   | $QR$   | $R^2$


More importantly for our simplicity in terminology, at the end of the mating episode, all of the next generations offspring are produced and all adults die.  If there were overlapping generations or multiple mating events, we would have to integrate these processes along a continuous timescale rather than treating a single mating generation as a discrete unit.  Totally possible, and even fun, but not part of the original formulation of HWE.

## Evolutionary Assumptions

Finally, we need to invoke two evolutionary assumptions to meet the requirements of HWE.  First, we must assume that there is $AB$solutely no mutation.  A non-zero mutation rate, $\mu > 0$, means that the state of a particular allele, say the $A$ allele, has a likelihood of spontaneously becoming something other than the $A$ allele (e.g., the $B$ in this formulation).  If HWE is to help us determine the genotype frequencies having one allele spontaneously mutate to another would require that we integrate the mutation rate into HWE directly.  Perhaps not that surprisingly, that wasn't integrated back in the day.

The next assumption has a very similar consequence.  Namely, we must assume that the set of individuals that are participating in the mating event in this population is static. Individuals from other populations are not immigrating into this population and individuals within the population are not emigrating out of this population.  Just like mutation, and when we get to migration we will see directly how similar these processes are, we will take the easy way out and assume that they do not occur.  

The final assumption is based upon selection and is a very easy one to deal with.  We simply assume it doesn't happen.  If selection were operating, say an extreme form such as increased lethality of the $AA$ genotype prior to reproduction, then we would have to both specify the way in which selection is operating as well as the its magnitude.  We will come back to this topic later but for the simplicity of HWE, we make the assumption that it has no effect.

## The Mechanics

In a sampled population, we estimate the frequencies of the genotypes as:

$P = \frac{N_{AA}}{N}$,

$Q = \frac{N_{AB}}{N}$, 

and

$R = \frac{N_{BB}}{N}$

where $N_{XX}$ is the number of $XX$ genotypes and $N$ is the total number of individuals in the sample.  From these genotype frequencies, we can directly estimate the frequency of each allele ($A$ and $B$) denoted as p and q as:

$f(A) = P + \frac{Q}{2} = p$

and 

$f(B) = R + \frac{Q}{2} = q$

using the lower case versions of $p$ and $q$.  Be careful about switching these up.  The population geneticist [John Nason](http://jnason.eeob.iastate.edu), makes the connection that this is what it means to “mind our p's and q's" though I suspect the etiology of this statement has more to do with liquid volume measurements than population genetics...  

Intuitively the formulation denoted above for $p$ and $q$ makes sense as the heterozygote genotype ($AB$) frequency, $Q$, is split evenly between the homozygote genotype frequencies, $P$ (for $AA$) and $R$ (for $BB$).   If these are the only two alleles in the population then there is the additional restriction that $p + q = 1$.  It is easy to expand these methods to more than two alleles, but again the original approximation was specifically set for 2-allele systems.

Here is an interactive widget that shows how the expected frequencies of the genotypes change, even under Hardy Weinberg Equilibrium, due to changes in the allele frequencies.

```{r echo=FALSE}
knitr::include_app("http://popgen.bio.vcu.edu/shiny/apg/hardy_weinberg/", height="700")
```

In R, we can approximate these formulas using the gstudio library.  After loading in the library, we can create data objects representing diploid loci as follows:

```{r message=FALSE, warning=FALSE}
library(gstudio)
hom.AA <- locus( c("A","A") )
hom.BB <- locus( c("B","B") )
het.AB <- locus( c("A","B") )
```

These objects have have properties associated with them appropriate for representing genetic loci.

```{r}
ploidy(het.AB)
is_heterozygote( het.AB )
```

And can be used in a vector or data.frame or other R container just like any other data type facilitating easy integration into existing analytical workflows.  

```{r}
genotypes <- c( hom.AA, het.AB, hom.BB)
genotypes
```

To examine genotype and allele frequencies for a set of loci, I'll make a random collection and then use those loci as an example.

```{r}
loci <- sample(genotypes, 20, replace=TRUE)
loci
```

Here the function sample() makes a random draw of the arguments given in the first position (the genotypes), and selects 20 (the second argument), and since there are fewer potential values than observed ones, you need to tell it that you can sample the same genotypes more than once (the `replace=TRUE` part).  In analytical population genetics, the use of randomization and permutation is critical and we will see this repeatedly throughout this text.  Technically, this is a Monte Carlo simulation that was done.  Fancy, no?

Next, we can take these data and estimate genotype frequencies.  Here the expectation is also given by the code using the methodologies described in the next section.

```{r}
genotype_frequencies(loci)
```

and allele frequencies from these loci are:

```{r}
frequencies( loci )
```

It should be noted that for these function, and for almost all the functions that return data in gstudio, the return object is a native data.frame.  Using these tools, it is easy to take our data and extract from it the components necessary for estimating the degree to which they conform to the expectations of HWE.

## Trans-Generational Transition Probabilities

If the assumptions in the next generation hold, then it is possible to iterate through all possible combinations of matings, noting the probability of observing each type, and estimate the frequency of offspring genotypes.  This is a bit messy and is probably better displayed in tabular format.

Parents   | $\mathbf{P(Parents)}$ | $\mathbf{P(AA)}$ | $\mathbf{P(AB)}$ | $\mathbf{P(BB)}$
------------|:----------:|:-------:|:-------:|:-------:
$AA\;x\;AA$	| $P^2$      | $P^2$	 | &nbsp;  | &nbsp;
$AA\;x\;AB$	| $PQ$       | $PQ/2$  | $PQ/2$  | &nbsp;	
$AA\;x\;BB$	| $PR$       | &nbsp;  | $PR$	   | &nbsp;
$AB\;x\;AB$	| $Q^2$      | $Q^2/4$ | $Q^2/2$ | $Q^2/4$
$AB\;x\;BB$	| $QR$       | &nbsp;  | $QR/2$	 | $QR/2$
$BB\;x\;BB$	| $R^2$      | &nbsp;  | &nbsp;  | $R^2$

During the parents generation, the frequencies of $AA$, $AB$, and $BB$ were P, Q, and R, defining allele frequencies as p and q.  If we invoke all those assumptions about genetic, demographic, and evolutionary processes being $AB$sent from mating and production of offspring genotypes (e.g., notice how we do not have any terms in there for mutation, being sex biased, etc.) then the frequency of genotypes at the next generation are defined in the table.  For example, at the next generation, say $t+1$, the frequency of the $AA$ offspring in the population is the sum of probabilities of all matins that produced $AA$ genotypes.

$$
\begin{aligned}
P_{t+1} &= P_{t}^2 + \frac{P_tQ_t}{2} + \frac{Q^2_t}{4} \\\\ 
&= \left( P_t + \frac{Q_t}{2} \right)^2 \\\\
&= p^2
\end{aligned}
$$

The other homozygote is the same producing an expectation of the frequency of the $BB$ genotype at $t+1$ of $q^2$.  Similarly for the heterozygote

$$
\begin{aligned}
Q_{t+1} &= \frac{P_tQ_t}{2} + 2P_tR_t + \frac{Q^2_t}{2} + Q_tR_t\\
&= 2\left(P_t + \frac{Q_t}{2} \right)\left(R_t + \frac{Q_t}{2}\right)\\
&= 2pq
\end{aligned}
$$

Putting these together, we have an expectation that the genotypes $AA$, $AB$, and $BB$ will occur at frequencies of $p^2$, $2pq$, and $q^2$, which is exactly what HWE is.  It takes only one generation of mating under the assumptions of HWE to return all genotype frequencies to HWE.

## Consequences of HWE

The consequences of this are important for our understanding of what processes influence population genetic structure.  If none of these forces are operating in the population, then the expectation is that the allele frequencies should predict genotype frequencies.  However, if there are some forces that are at work, genetic, demographic, or evolutionary, then the frequencies at which we see these genotypes will deviate from those expectations.  Now, biologically speaking, is there any population that conforms exactly to these expectations—definitely not!  However, in many cases populations are large enough to not be influenced by small N, mutation is rare enough to not cause problems, etc.  

Using HWE, we can gain some insights into the which sets of forces may be influencing the observed  genotype frequencies.  In large part, we examine these changes as deviations from expectations with respect to loss or gain of heterozygotes.  These expectations, as depicted above, define an idealized situation against which we measure our data.  This metric was specifically designed to ignore almost every process and feature that population geneticists would be interested in looking at.  If our data are consistent with these expectations then we have no support for the operation of any of these processes.  Boring result, no?  In a larger sense, the entire field of Population Genetics is devoted to understanding how violations of these assumptions influence population genetic structure.  If everything was in HWE and none of these processes were operating, there would be very little for us to do besides DNA fingerprinting and we would all be forensic examiners...

If some of these features are operating, they do have specific expectations on how they would influence the frequency of genotypes (specifically the homozygotes).  Specific examples include:

Decrease Heterozygosity	Increase Heterozygosity
Inbreeding	Coding Errors
Null Alleles	Gametic Gene Flow
Positive Assortative Mating	Negative Assortative Mating
Selection against Heterozygotes	Outbreeding
Wahlund Effects	Selection for Heterozygotes


In closing, it should be noted that HWE presented here is easily expandable to loci with more than two alleles as well as for loci with different ploidy levels.  The same approaches apply, though it is not as clean to estimate the expectations.  In the next section, we look at how we specifically test for HWE in our data.

At a more meta level, one could see the entirety of this topic and population genetics as a discipline as understanding how deviations from HWE manifest and testing for the strength of these deviations.  It is becoming less common for people to actually test for HWE as there are many ways that you can have a deficiency (or excess) of heterozygotes and it is probably much better to focus on the specific processes that may causing these deviations rather than the magnitude of the deviation alone.  That said, it is a nice organizing principle to use HWE as a straw man argument on which to frame the rest of this topic.

<!--chapter:end:hardy_weinberg_equilibrium.rmd-->

# Genetic Drift {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_beach_donkey.jpg"></div>


Genetic drift is the stochastic process resulting from a reduction in the size of a population.  If populations are infinite in size, both allele and genotype frequencies should remain constant through time.  However, if populations are not infinite (reality knocking), then drift may act as a diversifying force.  This chapter examines the process of drift how we can measure its consequences.

Genetic drift is the process by which random selections of gametes produced during mating events result in stochastic changes in allele frequencies through time.  This is not a terribly ground shattering idea, and our understanding of random chance provides some great insights into the underlying mechanisms operating in populations.  

As an analogy, consider the simulation scenario where we have two alleles, $A$ and $B$, at a haploid locus (for simplicity).

```{r}
Alleles <- c("A","B")
```

Collections are drawn at random using different sample sizes 

```{r}
sample_size <- c(5,10,20,50,100)
```

We are implicitly assuming that the frequencies of these two alleles are identical (e.g., $p_A = p_B = 0.5$).  We can see the effects of genetic drift by running this ‘simulation' in which populations are randomly created selected and used to estimate the frequency of an allele and how sample size influences the stability of the allele frequency.  If `N` does not influence the estimation of allele frequencies then all examples should yield an estimated allele frequency equal to $0.5$ (as they truly are).  

To do this, we set up the parameters in a data.frame

```{r}
df <- data.frame( N = rep( sample_size, each=50 ),FreqA = NA)
```

and then go through data frame, each row in order, and pull out random collections from Alleles in sizes equal to sample_size.  To do this, I'm going to use a `for()` loop.  This is simply a function that repeats a set of code a specified number of times.  In this case, I will need to take the first row, make a random sample of alleles given the value of `df\$N[1]`, find the frequency of an allele in that random sample, and then store the frequency back into the data.frame at `df$FreqA[1]`.  I then need to go to the second row and do the same thing again.  And then the third row.  Then the fourth.  etc.  

The format of a `for()` loop is pretty easy.  You define a sequence of values to use (in this case I will use the variable named row to contain the sequence of integers from `1:nrow(df)`).  Each time row is assigned a value, the code within the curly brackets ‘{}' after the `for()` statement is executed and the value of row is constant during that iteration.  

```{r}
for( row in 1:nrow(df) ) {
  # make random selection from Alleles with replacement of a particular size
  a <- sample( Alleles, size=df$N[row],replace = T)
  # find the frequency
  f <- sum( a == "A" ) / length(a)
  # assign it back to the data.frame
  df$FreqA[row] <- f
}
```

Here, row starts out at 1, then the code inside the brackets is run with `row=1`.  After it reaches the end of the loop, row is incremented by one and the code is then run with `row=2`.  This is repeated until `row = nrow(df)` and then the looping is finished.  If you type this in and do it, you have done your first stochastic (some would call it Monte Carlo) simulation!  Congratulations.  That wasn't so hard.

Lets now look at the data.

```{r}
summary(df)
```

The frequencies observed in each run are all over the place, basically covering the entire allowable range for an allele frequency.  More importantly though lets look graphically at the allele frequencies using a boxplot where frequencies are grouped by sample size, `df$N`.

```{r drift_simulation, fig.cap="Allele frequencies estimated from random selections of alleles under varying sample sizes."}
library(ggplot2)
p <- ggplot(df, aes(x=factor(N), y=FreqA)) + geom_boxplot() 
p + xlab("Population Size") + ylab("Estimated Allele Frequency")
```

The plot of these data show an asymptotic reduction in the variation around the estimated allele frequency as you increase `N`.  

This is to be expected, as we sample more individuals, we will become more confident that the parameters we estimate are closer to the real value.  In statistics, this is the law of large numbers or the Central Limit Theorem.  It is the same thing here.  If there is a small population, each mating event may have the opportunity to have a wide fluctuation in gamete allele frequencies.  All other things being equal, the frequencies will change each generation, an amount inversely proportional to the size of the population.

Small populations are not rare, some organisms exist only at small densities (e.g., top predators).  Others may have undergone population bottlenecks, exist in marginal habitats at the edge of their tolerance range, or have a life-history that goes through periodic outbreaks.

Algebraically, we can determine the probability of observing a specific number of alleles in the next generation given random mating assumptions and a small sample size.  The selection of diploid alleles follows a binomial expansion.

\[
P(N_A|N,p_A) = \frac{2N!}{N_A!(2N-N_A)!}p_A^{N_A}(1-p_A)^{2N-N_A}
\]

This expansion has two parts:

The binomial coefficient, $\frac{2N!}{N_A!(2N-N_A)!}$, that determines the number of different ways we can get two alleles by random draw and have $N_A$ samples of the $A$ allele.

The probability, $p_A^{N_A}(1-p_A)^{2N-N_A}$, of observing any each occurrence of the those samples based upon the frequency of the $A$ allele, $p_A$, in the population.

```{block, factorial-too-big, type='comment_warning'}
Your ability to use the binomial expansion to solve this kind of problem relies upon the sample size, which becomes rather large rather quickly.  In fact, R cannot estimate the value over 171!
```


As an example, consider the idealized situation where we have $N=20$ individuals, all of which are heterozygotes.  That means that for a diploid locus, $N_A = N_B = 20$.  The probability of observing exactly 20 $A$ alleles in the next generation can be calculated as:

```{r}
p <- 0.5
N <- 20
N.A <- 20
```

where the binomial coefficient is given by

```{r}
coef <- factorial(2*N)/ (factorial(N.A)*factorial(2*N - N.A))
coef
```

and the probability of each one of those is

```{r}
prob <- p^N.A * (1 - p)^(2*N-N.A)
prob
```

and the final probability associated with observing $N_A=20$ after a single round of random mating is

```{r}
Prob.20.A <- coef*prob
Prob.20.A
```

Which is not so good, if you are interested in stability of allele frequencies.  Only `r 100*print(Prob.20.A,digits=0)`% of the time would the allele frequency stay the same and `r 100-100*print(Prob.20.A,digits=0)`% of the time allele frequencies would change.  This change is not so much an evolutionary forces as it is a stochastic process associated with low sample sizes.  

So if it is not that likely that the allele frequencies will stay the same, what does the entire distribution of expected allele occurrences look like? Lets set the variable N.A equal to a sequence of potential allele counts (from 0 to 2*N) and then estimate and plot the distribution.

```{r}
N.A <- 0:(2*N)
Frequency <- factorial(2*N)/( factorial(N.A)*factorial(2*N-N.A)) * p^N.A * (1-p)^(2*N-N.A)
df <- data.frame(N.A,Frequency)
ggplot(df,aes(x=N.A,y=Frequency)) + geom_bar(stat="identity") + xlab(expression(N[a]))
```

Even for the tails of this distribution, the probabilities are not zero (here are the first 10 entries)

```{r}
df[1:10,]
```

meaning that it is possible to start with everyone as a heterozygote and after a single random mating event, the population may either be fixed for the $A$ allele or have lost it altogether and only have aa homozygotes.  The shape of this distribution is not only influenced by $N$ but also by $p$. Here is what it looks like if the frequency of the allele is $0.10$ instead of $0.50$.

```{r}
p <- 0.1
Frequency <- factorial(2*N)/( factorial(N.A)*factorial(2*N-N.A)) * p^N.A * (1-p)^(2*N-N.A)
df <- data.frame(N.A,Frequency)
ggplot(df,aes(x=N.A,y=Frequency)) + geom_bar(stat="identity") + xlab(expression(N[a])) 
```

## Consequences of Genetic Drift

The more skewed the allele frequency is away from $0.5$, the more likely the population will be to become fixed for one allele or the other.  

### Time to Fixation

In fact, there is a well known relationship between the probability of fixation and the combination of both allele frequency and population size.  Namely, the expected time to fixation, tfix (in generations), for a population of size Ne with two alleles (occurring p and q) is:

\[
t_{fix} = \frac{-4N_e(1-p)\log{(1-p})}{p}
\]

which for values of $N$ for $p=c(0.01, 0.1, 0.25, 0.5)$ are

```{r echo=FALSE, fig.cap=""}
library(ggplot2)
df <- data.frame()
N <- 10^(1:5)
for( p in c(0.01, 0.1, 0.25, 0.5)){
  t <- (-4*N*(1-p)*log(1-p))/p
  df <- rbind( df, data.frame(p,N,t))
}
df$p <- factor( df$p )
ggplot( df, aes(N,t,color=p)) + geom_line() + xlab("Population Size (N)") + ylab("Expected Time to Fixation (generations)")
```


This parameter Ne can be quite deviant from the census size $N$ depending upon several features of the organisms life history.  We return to that later and discuss it in depth, for the time being, lets just assume it is a measure of the size of a population.  That said, the stochastic selection of alleles due only to population size, can have significant effects on allele frequencies, available genetic diversity, and genotypic composition.  In this simple two-allele system (often referred to as the Wright-Fisher model), if drift is the only feature that is influencing allele and genotype frequencies, the variance in allele frequencies through time has an expectation of:

\[
\sigma^2_p = pq\left[ 1-exp\left(-\frac{t}{2N_e}\right)\right]
\]

which if examined for changes in Ne for fixed p=0.5 


```{r  fig.cap="Expected variance in allele frequencies through time for a Wright-Fisher model of genetic drift for three different effective population sizes.", echo=FALSE}
t <- 1:100
p <- q <- 0.5
ne <- function( t, Ne, p, q) { return( p*q*( 1 - exp( -t/(2*Ne)))) }
df <- data.frame( Generation=c(t,t,t),
                  sigma_p= c( ne(t,10,p,q), ne(t,100,p,q), ne(t,1000,p,q) ),
                  Ne=rep(c("10","100","1000"), each=100) ) 
ggplot( df, aes(x=Generation, y=sigma_p, color=Ne)) + geom_line() +
  xlab("Generation (t)") + ylab(expression(sigma[p]^2))
```

or changes in p for fixed Ne = 100 are

```{r fig.cap="Expected variance in allele frequencies through time for a 2-allele Wright-Fisher model of genetic drift across different starting allele frequencies."}
df <- data.frame( Generation=c(t,t,t),
                  sigma_p= c( ne(t,100,.1,.9), ne(t,100,.25,.75), ne(t,100,.5,.5) ),
                  p=rep(c("p=0.10","p=0.25","p=0.50"), each=100) ) 
ggplot( df, aes(x=Generation, y=sigma_p, color=p)) + geom_line() +
  xlab("Generation (t)") + ylab(expression(sigma[p]^2))
```

The important distinction here between these two graphs are that:  

1. For different effective population sizes, the larger the population, the more stable the allele frequencies through time.  For $N_e = 1000$, the variance in allele frequencies is relatively small, compared to the other population sizes, even after 100 generations.  
2. Allele frequencies show the opposite effect, with low allele frequencies, the variance is less than that in populations with larger allele frequencies (maximizing when $p = \frac{1}{\ell}$, where $\ell$ is the number of alleles).  There is more genetic variance with a more even distribution of allele frequencies (something we will return to when we talk about Fishers Fundamental Theorem).

### Time to Allele Loss

Time to fixation relates to the loss of all alleles, though in the data we often deal with, there is not a simple $p=q$, two allele system.  However, the models developed thus far, can give us an idea of the expected time to allele loss by rearranging the expectations a bit.  

\[
t_{loss} = -\ln(p)\frac{4N_ep}{(1-p)}
\]

## Mutation & Drift

If mutation is also a factor, then the time to allele loss, tloss, is modified a bit.  We will come back to discussions of mutation in a later chapter and get a bit more in-depth to its consequences. However, depending upon the mutation model, its effects are generally that it increases the rate of allele loss rather than decrease it.  This is because a random mutation (in a system where there are more than two alleles) is more likely to make allele A become something else than make one of the potential large number of other alleles into an A.

With a non-zero mutation rate, $\mu$, the time to loss is approximated to be

\[
t_{loss} = \frac{1}{\mu}
\]

if and only if $N_e\mu$ is very small (e.g., << 1).  However, if it is larger than that, the expectation can be approximated as:

\[
t_{loss} = \frac{\ln(N_e\mu) + \gamma}{\mu}
\]

where the $\gamma$ term is Euler's constant (Masel et al. 2007).


<!--chapter:end:genetic_drift.rmd-->

---
output: 
  html_document: 
    keep_md: yes
---
# Effective Population Size  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_chiten.jpg"></div> 

The concept of a "population" is a bit nebulous and up for debate depending upon how you define it and what you are measuring.  This section examines the way in which the size of a population can be estimated and the consequences of how alternative mating systems may influence the distribution of genetic variation within populations.

In the previous section on genetic drift, we showed how the number of individuals in the population is critical to the stability of allele (and by extension genotype) frequencies.  In this section, we examine additional factors that may influence the number of individuals that should be considered when discussing the consequences of genetic drift with respect to conservation and management practices.

## Demographic Effective Population Sizes

The census population size is a measure of the raw number of individuals present in a population, $N$.  From a genetic perspective, the counting number of individuals in a population is not quite what we are after as there are many different factors (sex rations, life history traits, age stratification, behavioral processes) that may influence the size of the ‘breeding' population that is contributing to the set of alleles in the next generation.   

Here are several situations we need to consider.  

1. *Selfing:*  In a population where individuals produce offspring without the need to mate with other individuals, the effective population size of the population is 
\[
N = N + \frac{1}{2}
\]
This is one of the few instances where you can have an effective population size larger than the census population size. This is because each of the individuals in the population can produce offspring without any other individuals.  Consider the extreme case where a single colonist arrives in a new habitat.  Here, because of the mating system, the individual may actually produce an offspring and persist despite being a lone colonist.  
2. *Monoecious:* Here the species may have reproductive structures that can act as both male and female but they are located on the same individual.  Here the probability of selecting a gamete is $\frac{1}{N}$.  Given that there are $N$ individuals in the population, the probability of two individuals have the same allele from the same parent (IBD) is 
\[
P(IBD) = N \left( \frac{1}{N_e} \right)^2 = \frac{1}{N}
\].   
3. *Dioecious:* Here there are two sexes in the species, say male and female.  The number of males and females ($N_m$ and $N_f$ respectively) determine the effective size.  If there is random mating among the individuals, and allelic segregation is random then the probability that each allele is given to the next generation is $\frac{1}{2}$.  For any randomly selected female,  the probability that she would give the same allele to two of her offspring (IBD) is: $\frac{1}{2}\frac{1}{2} = \frac{1}{4}$.  In this population of $N_f$ female individuals, the $P(IBD|N_f) = \frac{1}{4N_f}$.  For the males, the exact same calculation is made and their probability if IBD is $P(IBD|N_m)=\frac{1}{4N_m}$.  Taken together, the total 
\[
\frac{1}{N_e} = \frac{1}{4N_m} + \frac{1}{4N_f}
\]
and solving for $N_e$ gives
\[
N_e = \frac{4N_mN_f}{N_m + N_f}
\]
Depending upon the skew in sex ratio here, the differences in $N_e$ may be striking.  Consider the following code.

```{r warning=FALSE, message=FALSE}
library(ggplot2)
N <- 100
Nm <- 1:99
Nf <- N - Nm
Ne <- (4*Nm*Nf)/(Nm + Nf)
df <- data.frame(Nm,Ne)
ggplot( df, aes(x=Nm,y=Ne)) + geom_line() + geom_point() + theme_bw(base_size=16)
```
Here the maximal size the population can be occurs when there are even sex ratios (e.g., $N_m = N_f$) and it decreases precipitously as you deviate from equality.  In extreme situations, there may be only one (or a diminishing small number) of male or female individuals that participates in mating events (e.g., harem mating in elephant seals).

Some genes are located on sex-specific chromosomes (e.g., X-Linked genes in humans).  This is an identical situation to mixed ploidy taxa such as the Hymenoptera where the female is diploid and the male is haploid.  Given the unequal distribution of alleles in the population based upon who has one chromosome and who has two, the effective population size here is given as:

\[
N_e = \frac{9N_mN_f}{4N_m + 2N_f}
\]

In this case, if $N_m = N_f$ then $N_e = \frac{3}{4}N$, which makes sense as the female individuals are giving two copies to the next generation but the males can only contribute a maximum of one copy.  As a result, the maximum population size here is always going to be less than for diploid species/markers.  

In the extreme case where there is only one female, as is found in honeybees, the equation reduces to

\[
N_e = \frac{9N_m}{4N_m+2}
\]

which in the limit of large $N$ approaches $\lim_{N_m \to \infty} = 2.25N$.

## Genetic Effective Population Size

There are several ways to estimate a ‘genetic effective' population size, two of which we will be covered here.  In essence, a genetic effective population size is an idealized population, one in which all the factors in HWE are true AND produces a set of data like the one you see.  Most measurements of effective population size are based upon two generations of data.  In each generation, a statistic is derived, and the change in that statistic is used to estimate the idealized population size that could produce a change in that parameter as large as the one observed.    

### Inbreeding Effective Population Size

This parameter estimates the size of an idealized population needed for changing heterozygosity across generations by some specified amount.  This change in heterozygosity is primarily attributed to inbreeding, a topic fully explored in \@ref(selfing) and \@(mixed-mating).  The inbreeding effective size, $N_e^i$ is based upon the genealogy of alleles at a locus and the change in inbreeding values across generations. 

```{r echo=FALSE, fig.cap="Two ways alleles at a single locus can be the same.  Either they are descended from a common ancestor (autozygosity) or are derived from different ancestors but mutation has cause them to be identical (allozygosity)."}
knitr::include_graphics("media/allele_lineages.png")
```


Each allele in a genotype has a history that can be traced backwards.  If it is a homozygote genotype, the two alleles can be the same because of two possible scenarios.

*Identity by Descent (IBD)* or *Autozygosity*: Both alleles are the same because they can be traced, in a pedigree, back to a single common ancestor that gave the same allele to two different offspring. This is the definition of ‘inbreeding' that we will come back to.  

*Identity by State (IBS)* or *Allozygosity*: Both alleles are the same but only because of a mutation event that caused them to become identical at some time in the past.

The distinction is important because we need to define the probability of IBD in a population.  Consider a population of size $N_e$. In this population, the probability of two individuals having the same allele from the same parent (IBD) is 

\[
P(IBD) = \frac{1}{2N_e^i}
\]

The value of $N_e^i$ can be estimated directly by looking at the change in heterozygosity through time, which is why we call it inbreeding effective size.  The expected heterozygosity at some point in the future, say $t$ generations, is based upon the starting heterozygosity and the population size.

\[
H_t = H_0 e^{-\frac{t}{2N_e^i}}
\]

which can be re-arranged to solve for $N_e^i$ as:

\[
N_e^i = -\frac{t}{2 \ln \left( \frac{H_t}{H_O} \right) }
\]

Thus estimating the change in heterozygosity at two points in time, provides a measure of the genetic effective size an idealized population would be to see similar changes in heterozygosity assuming all the remaining factors related to HWE are correct.  

### Variance Effective Population Size

Instead of looking at changes in heterozygosity, you can also look at changes in allele frequencies across generations.  As shown previously, the probability that allele frequencies change across generations is directly related to population size, smaller populations have a greater chance of producing offspring whose allele frequencies vary wildly from their parents generation.  As the population size increases, the variance in allele frequencies ($\sigma_{\delta p}^2$) across generations decreases. 

\[
\sigma_{\delta p}^2 = \frac{p_{t-1}q_{t-1}}{N_e^v}
\]

Rearranging this function to solve for the variance effective population size, $N_e^v$, give us:

\[
N_e^v = \frac{p_{t-1}q_{t-1}}{2\sigma_{\delta p}^2}
\]

Both $N_e^i$ and $N_e^v$ approach the problem from a different perspective and could provide vastly different numerical approximations for the underlying parameter you are interested in examining.  As a result, you must make sure that you are explicit in describing the conditions under which you are estimating these parameters and be specific about the way in which you are estimating them.

## Fluctuating Population Size

The effective population size may also change, stochastically, through time as the population grows and shrinks.  Imagine the scenario where a population has $N_t=100$ individuals.  At the next generation, some drastic change in the environment causes a large reduction in the size of the population and it is reduced to $N_{t+1}=10$ individuals.  This demographic bottleneck causes the effective number of individuals to be reduced below the original amount.  Genetically, it could only have (at most) a total of 20 alleles at any locus (though that is highly unlikely to have everyone being a heterozygote).  The effects of this reduction will be felt for many generations to come, even if the population is able to rebound to larger numbers of individuals.

To derive the expectation here, we need to consider the size of the population through time.  The probability of not being IBD during any generation is $1 - \frac{1}{2N}$.  The effective size of the population is going to be a function of the size at each generation based upon the probability of not being IBD. For fluctuating population size, $N_e$ is the idealized population size that captures the variance in $N$ across generations.  Consider the following case with two generations of data being used to estimate $N_e$.

\[
\left( 1 - \frac{1}{2N_e} \right)^2 = \left( 1 - \frac{1}{2N_0} \right)\left( 1 - \frac{1}{2N_1} \right)
\]

If we solve this for $N_e$ we get

\[
\frac{1}{N_e} = \frac{1}{2}\left( \frac{1}{N_0} + \frac{1}{N_1} \right)
\]

which is the definition of the harmonic mean.  For any arbitrary number of generations through time then, 

\[
\frac{1}{N_e} = \frac{1}{t}\left( \frac{1}{N_0} + \frac{1}{N_1} + \cdots + \frac{1}{N_{t-1}} \right)
\]

Here is an example using R to show how a single bottleneck can influence population size through time.  

```{r warning=FALSE, message=FALSE}
N <- c(100,100,10,rep(100,50))
Ne <- 100
for( i in 2:length(N)) {
  Ne[i] <- 1 / ( 1/i * sum(1/N[1:i]) )
}
df <- data.frame(Time=1:length(Ne), Ne )
ggplot(df, aes(x=Time,y=Ne)) + geom_line() + geom_point() + theme_bw(base_size=16)
```

which produces the graph above showing the effects of this bottleneck on the population lasting many generations.


<!--chapter:end:effective_population_size.rmd-->

# Mutation  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_red_newt.jpg"></div>

Mutation is the ultimate source of genetic variation.   If mutation never existed, all our loci would be the same homozygote and all of you would look like me!


Mutation is the process whereby changes are induced into genetic encodings.  Without mutation as a process, there would be no genetic variation (and no evolution).  In basic terms, we can break mutations into two categories.

*Somatic Mutation:* A mutation that occurs outside the germline and will not be passed to subsequent generations.  Somatic mutations may influence the fitness (survival and reproductive output) of individuals but do not have an effect on the genetic information being passed to offspring.

*Germ Line Mutation:* A mutation that can be passed to a gamete, resulting in an overall change in population genetic structure.
There are several mechanisms that can result in mutations.  Mutation as a process occupies a special place in the public vernacular—just look at all the science fiction movies where mutation is the initial arbiter of conflict.  Fortunately, these scenarios do not reflect reality.  

There are many types of mutations that we can encounter in our data.  Much of what we teach about mutations at the undergraduate level has to do with DNA sequences that encode for amino acids, though much of population genetics focuses on genetic markers outside of coding regions.  It is illogical to talk about missense, nonsense, or synonymous mutations at a microsatellite locus.  For the purposes of this text, we will focus on nucleotides and other common markers.

## Mutation Models

Mutation models are conceptual characterizations of how the process of mutation results in the development of new allelic states.  At this stage, I'm going to focus only on the change of the underlying genetic material (e.g., a transition mutation or altering allele A to become allele a) and not consider the fitness consequences of these changes.

No matter how large the population is, with increasing $\mu$, the presence of newly mutated alleles are almost always observed in heterozygotic state.  

From a sampling perspective, it is difficult to get a good estimate of the allele frequencies for these rare alleles.  Depending upon your research question, this may be a good thing or a bad one.  For example, if you are looking at demographic processes, the presence of a rare allele may indicate that historical (or ongoing) gene flow may be an important factor.  Conversely, rare alleles (by definition) are not commonly observed, so if there is a ‘marker' like this, it may be difficult to actually sample without large sampling effort.

Small allele frequencies of newly mutated allele, pi, are more likely to be lost due to genetic drift as the they are more likely to be found as heterozygotes.  Only half the offspring are expected to have them and the probability of persistence is proportional to its current allele frequency.

At a two-allele locus, we can model the frequencies for a particular set of data as falling along a line defined by $p+q=1.0$.  Consider the three populations in the next figure, who have allele frequencies of f(A) = c(0.15, 0.42, 0.59).

By definition, the allele frequencies define a coordinate in 2-dimensional space for each population.  This coordinate space has as many dimensions as there are alleles at the locus, each of them being orthogonal to all the others.  For a 2-allele system, they can be represented by a x & y coordinate.  

```{r echo=FALSE,fig.cap="Allele frequencies depicted for three different populations.  All population frequency spectra must fall on the dotted line.", message=FALSE, warning=FALSE}
df <- data.frame( p=runif(3), label=c("Pop-1","Pop-2","Pop-3") )
df$q <- 1-df$p
library(ggplot2)
library(ggrepel)
ggplot(df,aes(p,q)) + geom_abline( slope=-1, intercept=1, linetype=2, color="red") + geom_point() + geom_text_repel(aes(label=label), nudge_x=0.025, nudge_y=0.025) + xlim(0,1) + ylim(0,1)
```

Three allele loci can be plot in 3-space ($x$, $y$, and $z$ coordinates).  Higher number of allele per locus are more problematic to depict on a 2-dimensional image, but the analogy holds.  

With an $\ell$-allele locus, we have $\ell-1$ statistically independent alleles.   We loose one degree of freedom because the frequencies are under the constraint of $\sum_{i=1}^\ell p_i = 1.0$---all frequencies must sum to unity.  Statistically, this approach allows us to consider our individual genotypes as well, as being describe as a matrix with $\ell-1$ statistically independent columns.  So, for example, consider the genotypes $AA$, $AB$, $AC$, $BB$, $BC$, $CC$.  These genotypes can be encoded in a matrix with 6 rows (one for each individual) and 3 columns (one for each allele).  The values inserted into each are the frequency (or alternatively the count) of the alleles in each locus.

\[
X = \begin{bmatrix}
    1 & 0 & 0 \\
    0.5 & 0.5 & 0 \\
    0.5 & 0 & 0.5 \\
    0 & 1 & 0 \\
    0 & 0.5 & 0.5 \\
    0 & 0 & 1 \\ 
\end{bmatrix}
\]

In R, we can encode loci

```{r message=FALSE, warning=FALSE}
library(gstudio)
loci <- c( locus( c("A","A")), locus( c("A","B")), locus( c("A","C")), locus( c("B","B")), locus( c("B","C")), locus( c("C","C")) )
loci
```

and translate them into multivariate encoded data in the same way.

```{r}
mv.genos <- to_mv( loci )
mv.genos
```

If you are going to use this as a data matrix for multivariate analyses, you need to drop one of the columns since each column sums to unity.  If you do not, you will not be able to invert the matrix because it is singular and essentially the entire universe will cease to exist—ok maybe I'm exaggerating a bit here but inverting singular matrices is not something you want to try.  It does not matter which of the columns you drop, some like to drop the most rare allele or the most common one, but it is irrelevant.

```{r}
to_mv( loci, drop.allele = TRUE )
```

The reason that I think this is important, in the context of mutation, is that a mutation event can move entries from one column to another (if an allele mutates to anther existing allele) or add additional columns to the matrix (if the mutation produces an allele that was not previously in the data).  The dimensionality of the data, for a locus and across loci, is simply the number of independently assorting alleles in the system.  Mutation as a process can add additional dimensions (e.g., degrees of freedom in a statistical sense) to the data itself. 
In the following mutation models, consider the consequences to how these models influence the way in which we denote genotypes and population allele frequencies in a multivariate context.

### Fixed Allele Model

The fixed allele model is one where there are a finite number of allelic states.  The frequency of the allele, say $f(A) = p$, can estimated across generations if we know the rate at which say the A is mutated to become the other state, say the a allele.  For simplicity, lets define this rate as $\mu$, on a per generation basis.  Across generations then, it is pretty easy to predict what the frequency of both alleles will be each generation by iterating each generation and estimating the frequencies based upon the previous generation.  An example of this is given in the table below, where a locus fixed for the A allele is experiencing mutation at a rate of $\mu=0.001$.

With each successive generation, a fraction of the A alleles are being mutated.  Under a fixed allele model, these mutations result in the identity of that allele being one of a fixed number of other alleles.  If this were a dual fixed allele model, the mutation rate, $\mu$, would represent the rate of change from $A \to a$.  There would also be a corresponding rate in the opposite direction, say $a \to A$ at a per generation rate of $\nu$.  The presence of these two alleles (or at least a finite set of defined alleles), who change state due to mutation is the definition of the fixed allele model.

$AA$, $AB$, $AC$, $BB$, $BC$, $CC$.

Once defined, there is a pretty easy set of algebra associated with predicting what the next and subsequent generations of allele frequencies may be based upon mutation rates.  Later when we examine migration, we will see a similar approach to estimating allele frequencies so pay attention to both the specifics of the this as well as how we are formulating the equation conceptually.

Assuming a 2-allele system, the frequency of the $A$ allele, denoted as say $p$, in the next $(t+1)$ generation is:  

1. The frequency of the allele at this generation, $p_t$, multiplied by the fraction of individuals who did not mutate from $A \to a$.  This can be written as $p_t(1-\mu)$.  To this we also add,  

3. The fraction of alleles that were not $A$ at this generation $(1-p_t)$ but who were mutated to the $A$ state during the transition from $t$ to $t+1$.  This is written as $(1-p_t)\nu$.

So the frequency at $p_{t+t}$ is:

\[
p_{t+1} = p_t(1-\mu) + (1-p_t)\nu
\]

We can plot this relationship through time to visualize how allele frequencies change.  Here I start with a relatively high mutation rate of $\mu=0.0075$ ($A \to a$) and $\nu = 0.01$ ($a \to A$) and a starting allele frequency of $p=0.75$.  We can set up a ‘simulation' on these parameters as:

```{r}
p <- 0.75
mu <- 0.75e-2
nu <- 1e-2
T <- seq(1,3000000,by=5000)
p <- rep( p, length(T) )
for( i in 2:length(T)){
  pt <- p[i-1]
  p[i] <- (1-mu)*pt + (1-pt)*nu
}
```


Where I create two vectors, one for time, $T$, and the other for the current allele frequency.  I then iterate across generations (the `for(i in 1:length(t))` ) and for each generation I grab the allele frequencies of the previous generation and use them to estimate the frequency for next generation.  After we iterate across all the generations, we can plot the expected allele frequencies through time as:

```{r message=FALSE, warning=FALSE, echo=FALSE, fig.cap="Expected allele frequencies for a single 2-allele locus with mutation rates of $\\mu=0.0075$ and $\\nu=0.01$ and a starting allele frequency of $p_0 = 0.75$."}
library(ggplot2)
df <- data.frame( T, p )
ggplot( df, aes(T,p) ) + geom_line() + xlab("Generation (t)") + ylab("Allele Frequency (p)")
```

There are two things that are of interest in this figure.

1. Given the values of $\mu$ and $\nu$ the final frequency will change through time tending towards some equilibrium frequency, $\hat{p}$.  The rate of change, $\delta p$, is dependent upon how far away the initial allele frequency is from $\hat{p}$ and the difference in the mutation rates, $|\mu - \nu|$.

2. The consequences of mutation are significantly though exceedingly slow.  If you look at the figure, notice the number of generations on the x-axis.  It takes somewhere in the vicinity of 500,000 generations for the allele frequency to go from $p=0.75$ to $p=0.60$!

If we set $p_{t+1} = p_t$ (e.g., when there is no change in allele frequencies) and solve for $p$, the previous equation be rearranged to give the equilibrium allele frequencies for the $A$ allele, $\hat{p}$, given the relative values of $\mu$ and $\nu$ as:

\[
\hat{p} = \frac{\nu}{\nu + \mu} 
\]

Independent of the starting frequencies, all populations will tend towards this equilibrium state (assuming nothing else is happening to the population during all those generations).  

The rate of change and the distance away from the equilibrium frequency, $\hat{p}$, can be used to derive an estimate of allele frequencies with mutation at any arbitrary generation.  This expectation is:

\[
p_t = \frac{\nu}{\mu + \nu} + \left(p_0 - \frac{\nu}{\mu+\nu} \right)(1 - \mu - \nu)^t
\]

If you look at the components of this relationship, it can be decomposed into the following parts.  

1. The destination allele frequency where the population will eventually stabilize at, $\frac{\nu}{\mu + \nu}$,   
2. The distance the starting generation is away from this destination, $\left(p_0 - \frac{\nu}{\mu+\nu} \right)$,   
3. The rate at which the both mutation directions change this frequency, $(1 - \mu - \nu)$,   
4. And the length of time (in generations) that has elapsed since starting at $p_0$, $t$. 

This is a pretty standardized construction and we will see a very similar approach when dealing with allele frequency changes due to migration.


### Infinite Allele Model

For many genetic components, there is more than just two different states, A and a (despite what we use in teaching population genetics).  Kimura & Crow (1964) defined the infinite alleles model.  Here, instead of having only two alleles with mutation flipping states of the alleles, the infinite alleles model adds new alleles to the locus with each mutation.

Both the fixed allele model and the infinite allele model can be easily configured in R using the multivariate encoding ‘allele space' paradigm outlined previously.

### Stepwise Mutation Model

Markers such as microsatellite require slightly different mutation models.  A microsatellite locus is one where nucleotide motifs are repeated and the length of the fragment containing the repeats (plus some flanking primer sequences) is used as the definition of the allele.  Mutation at microsatellite loci is due to the gain or loss of motifs during replication.  Mutation models for these kinds of loci incorporate the probability of gain/loss of one or more motifs.  Usually, the likelihood of single motif changes are considered more common than events resulting in changing fragment lengths by several motifs at once.   This kind of mutation model can be summarized as shown below for changes in allelic state across generations.  Later when we examine genetic distance measures, we will return to this metric.

```{r echo=FALSE, fig.cap="Schematic of stepwise mutation model across three generations, commonly used in describing changes at microsatellite loci with a repeat motif length of $\\delta$ deviating from a mean fragment size."}
knitr::include_graphics("media/mutation_cascade.png")
```


## Mutation and Inbreeding

Inbreeding, the relative decrease in heterozygosity (or increase in heterozygosity) is influenced by mutation rates because of the way that mutation influences the level of autozygosity.  

\[
F_{t+1} = \frac{1}{2N_e} + \left( 1 - \frac{1}{2N_e}\right)F_t
\]

where the next generation level of autozygosity is made up of:  

- $\frac{1}{2N_e}$ is the likelihood that two alleles at a locus are autozygous this generation, and  
- $\left( 1 - \frac{1}{2N_e} \right)F_t$ is the fraction of homozygous individuals in the previous generation that were autozygous.

From this relationship, we see that $F$ increases most rapidly towards its eventual asymptote with smaller effective population sizes (see Chapter \@ref(effective-population-size)).

The presence of mutation in this model breaks up the symmetry by changing the state of alleles delivered to the next generation.  For diploid individuals, mutation can occur as depicted in the following table.

Mutant Allele | Frequency 
:------------:|:---------:
0             |  $(1-\mu)^2$
1             |  $2\mu(1-\mu)$
2             |  $\mu^2$

For any particular locus, a mutation either not occur, occur once, or occur twice.  If an individual is $AA$ a mutation changes this genotype to be heterozygous, and by definition, incapable of being autozygous (or even allozygous).  A genotype of $AB$, while heterozygous and neither auto- or allozygous, cannot mutate to an autozygous state as they alleles cannot be descended from the same individual.  As a consequence, only genotypes that have 0 mutation events (at a rate of $(1-\mu)^2$) contribute to increases in $F$.

This leads to a reformulation of the expected inbreeding function as:

\[
F_{t+1} = \left[ \frac{1}{2N_e} + \left( 1 - \frac{1}{2N_e}\right)F_t \right](1-\mu)^2
\]

We can see the effects of having mutation with a small simulation.  

```{r}
T <- 1:100
F <- 1/16
mu <- 1/1000
Ne <- 10
F0 <- rep(0,100)
F1 <- rep(0,100)
inc <- 1/(2*Ne)
for( t in 2:max(T)){
  F0[t] <- inc + (1-inc)*F0[t-1]
  F1[t] <- (inc + (1-inc)*F1[t-1]) * (1-mu)^2
}
```

which if we plot both of these trajectories, produces the plot below.  

```{r echo=FALSE, fig.cap="Expected values for inbreeding F through time for a locus without mutation and one with mutation at a rate of $\\mu=0.001$. "}
df <- data.frame(Generation=c(T,T), F=c(F0,F1), Mutation=rep( c("Yes","No"), each=100))
ggplot(df,aes(x=Generation,y=F,color=Mutation)) + geom_line()
```

In these data, the overall effect on inbreeding is to modulate autozygosity through time by removing alleles that are identical by descent in the previous generation.  The functional consequence here are that while mutation can influence inbreeding to a large degree, it does so with a relatively small impact.  If you look back at the changes in allele frequency due to drift, the potential for changing allele frequencies was much greater than we see here.  After 100 generations in the code above, even with a $N_e = 10$, there is only a net difference in allele frequencies of $\delta p = 0.0372$!  Overall, mutation is the ultimate source of diversity, though has a small (though perceptible) per-generation influence on inbreeding.

As was the case in other situations, there is an expectation for equilibrium with respect to inbreeding based upon both effective population size ($N_e$) and mutation rate ($\mu$).  If we set $F_{t+1} = F_t$ and solve, we find the equilibrium inbreeding level to be:

\[
\begin{aligned}
\hat{F} &= \frac{1-2\mu}{4N_e\mu + 1 - 2\mu} \\
&\approx \frac{1}{4N_e\mu+1}
\end{aligned}
\]

with the approximation for small values of $\mu$ (small, as in what we usually see in our data).  


## Estimating Mutation

Mutation rates can be estimated from several kinds of data.  Here are some simple examples using phenotype and genotype.  The end result here is to use observation to uncover an estimate of the rate of mutation, $\mu$, from either or samples or our genetic data.

### Phenotypes

Outside some biochemical changes observed in model systems, there are few traits that we can examine directly that are the result of a single mutation event at a single location.  While we provide a lot of verbiage to single-gene traits in undergraduate curriculum when introducing Mendelian genetics, there are very few traits that actually respond in that fashion.

- *The mutation must have a distinctive phenotype.*  To identify the presence of a mutation, you must have the ability to clearly, and without error, cleanly identify when a mutation arises.  This trait must be distinctive in that it has to have a categorical condition.  Quantitative traits cannot be used since the contribution of an individual mutation may be minuscule in the trait value.  Moreover, several different mutation events, potentially spread throughout the genome, may contribute to the same incremental change in the observed phenotype. 

- *The trait must be fully expressed.*  Epistatic effects that mask mutants prevent accurate estimation of mutation rates.  For example, in the labrador retriever, the locus that controls coat color (brown is recessive to black) can be masked by a second locus that determines if the coat color pigment is put onto the hair shaft (if not the dog is yellow).  The presence of this second locus that influences the observed phenotype can prevent the estimation of mutation rates for the first.

- *The observed phenotype must only be the result of the mutation.*  If the phenotype is subjected to variation due to environmental effects, this can also cause problems with estimating the mutation rate.  

Given these caveats, it is possible to gain mutation estimates from phenotype, it is just difficult in most systems.

As an example, consider the case where a phenotype is governed by the normal dominant/recessive alleles as characterized by Mendel.  Mutation rates for the dominant allele can be estimated by examining the number of individuals in the next generation ($x$) that were produced by parents who did not carry the dominant trait (e.g., they were recessive homozygotes).  Here the mutation rate is estimated by:

\[
\mu = \frac{x}{2N}
\]

Estimation of mutation in recessive alleles can be accomplished in the same way.  Here though, offspring are only considered from mating events between individuals known to be homozygous dominant those whose genotype produces the recessive trait.  The offspring whose phenotype is  recessive ($x$) are those that the dominant allele given by one parent had mutated.  As such, since only one parent is contributing, the mutation rate here is estimated as:

\[
\mu = \frac{x}{N}
\]


### Genotypes

Looking at the genotypes directly can also provide an estimate of mutation rate.  I'll give two quick examples below, as both are relatively straight forward in their implementation.  First, if one can maintain lineages of individuals across generations, it becomes relatively easy to monitor the rate at which novel mutations occur.  Perhaps the most elegant examples of this approach is that from Lynch *et al.* (2008) who propagated parallel yeast cell lines across roughly 4,800 divisions.  Subsequent sequencing of original and progenitor samples provided an estimated rate of $\mu=0.00000000033$ per generation (much smaller than we have been playing with in this section).  They also examined differences in transition and transversion mutations as well as those in the mitochondria.  This paper is a tour-de-force and should be read by everyone interested in mutations.

Not everyones study organism fits into a 4000 generation propagation program! That said, there have been some pretty good estimates of mutation rate based upon reconstruction of relatedness among historically separated lineages.  If you have a good reconstruction of genetic divergence (as in a phylogeny) and events that have been verified to have specific dates ascribed to them (such as fossil records), you can also estimate mutation rates from these.  By far, mtDNA has been used as it has is assumed to be ‘clock-like' (though see Galtier *et al.* 2009), though other targets have been used as well.  At present, we have several lines of evidence suggesting rates of mutation in various taxonomic groups, each of which may differ substantially.  Since rates differ across groups and across regions in the genome, it is important that you consider the consequences that particular values of $\mu$ have on your biological inferences.  It is much more common to need an estimate of mutation as an input to another analysis, such as a coalescent model, than to need to estimate it directly from your data.  In these common situations, it is good practice for you examine the consequences variation in $\mu$ have on your downstream biological inferences—in some situations a magnitude or more change in mutation may have no appreciable effect whereas in others it may.

<!--chapter:end:mutation.rmd-->

# (PART) Nonrandom Rating {-}

<!--chapter:end:part_nonrandom_mating.rmd-->

# Selfing  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_dogwood.jpg"></div>

Inbreeding is one of the most common deviations from random mating (and hence Hardy-Weinberg Equilibrium) that we encounter in natural populations.  Inbreeding is defined, sensu stricto, as mating between two related individuals.  This ranges from complete selfing, where one parent alone produces offspring, to consanguineous mating, where individuals with some degree of relatedness produce offspring.  The genetic consequences of inbreeding are entirely in how alleles are arranged into genotypes, not in changing allele frequencies.

The primary consequence of inbreeding is a reduction in the frequency of the heterozygous genotype.  Consider the following Punnet square where a heterozygote is producing a selfed offspring.

&nbsp; | $A$  | $B$
-----:|:----:|:----:
$A$   | $AA$ | $AB$
$B$   | $AB$ | $BB$

The offspring in the next generation are only 50% heterozygotes.  Each generation of selfing,  homozygotes produce homozygotes but only half of the offspring from heterozygotes stay as such. This process increases the relative frequency of homozygous genotypes in the population, though if you look at the offspring, the frequency of alleles do not change—there are as many A alleles as B alleles in the next generation of selfing. 

Conceptually, we can define an inbreeding parameter, $F$, depicting the extent to which genotype frequencies have deviated from HWE due to inbreeding.  But to do so, we need to differentiate between homozygote genotypes that have the same alleles because they are inbred (e.g., both alleles can be traced to a single allele in common ancestor) from those that are identical because they just happen to have the same allele (not of common ancestry). For this, we will define terms for these genotypes as:

**Autozygous** - Two alleles that are identical within a genotype because they came from the same individual in the previous generation. These alleles are Identical by Descent and will be found in the population at a rate of $pF$.

**Allozygous** - Two alleles that are identical within a genotype but they came from alternate individuals in the parental generation. These alleles have Identity by State and are expected to occur at a frequency of $p^2(1-F)$.

Together, the expected frequency of the AA genotype, $E[AA] = p^2(1-F) + pF$.

At the extremes, if there is no inbreeding—all homozygotes are allozygous—the inbreeding statistic, $F=0$ and the expectation reduces to $E[AA] = p^2$.  Conversely, if all offspring are the result of selfing, $F=1$ and $E[AA] = p$. 

Often the parameter $F$ is the subject of our analyses and the item that is to be estimated from genetic data.  Give the definition above, $F$ is defined as the proportional loss of heterozygosity and is estimated as:

\[
\begin{aligned}
F &= \frac{H_e - H_o}{H_e} \\ 
  &= 1 - \frac{H_o}{H_e}
\end{aligned}
\]

The key point here is that inbreeding (autozygosity) is estimated ‘relative' to the expected frequencies of non-inbred heterozygotes (allozygosity).

Selfing is the most extreme form of inbreeding is that of selfing—one parent donates both gametes to the offspring.  In selfing systems, the frequency of genotypes $P = freq(AA)$, $Q = freq(AB)$, and $R = freq(BB)$, change in through time in a predictable fashion providing us a quantitative approach to characterizing the duration of inbreeding from allele and genotype frequencies.  

From first principles, selfing for these genotypes produces the following expected offspring genotype frequencies:

Parent |	Offspring | Frequency
------:|:----------:|:----------
$AA$	 |    $AA$    |   	$P$
$AB$   |  $1/4\;AA$	|    $Q/4$
&nbsp; |	$1/2\;AB$ |    $Q/2$
&nbsp; |  $1/4\;BB$	|    $Q/4$
$BB$   |  $BB$      |	   $R$

Such that the genotype frequencies in the next generation are:

\[
\begin{aligned}
P_{t+1} &= P_t + \frac{1}{4}Q_t \\
Q_{t+1} &= \frac{1}{2}Q_t \\
R_{t+1} &= R_t + \frac{1}{4}R_t
\end{aligned}
\]

The interesting thing here is that the allele frequencies at the next generation are derived from the genotype frequencies as:

\[
p_{t+1} = P_{t+1} + \frac{1}{2}Q_{t+1}
\]

But with a little re-arrangement of terms, we see that 

\[
\begin{aligned}
p_{t+1} &= P_{t+1} + \frac{1}{2}Q_{t+1} \\
 &= \left(P_t + \frac{1}{4}Q_t \right) + \frac{1}{2}\left( \frac{1}{2}Q_t \right) \\
 &= P_t + \frac{1}{2}Q_t \\
 &= p_t
\end{aligned}
\]

showing that while genotype frequencies change with each generation of inbreeding, the underlying allele frequencies remain constant!  Inbreeding only changes how alleles are packed into genotypes and in not changing frequencies, does not result in evolutionary change, *sensu stricto*.


## Changes in F

For every generation with selfing, the average level of inbreeding in the population will increase.  The amount it increases depends upon how inbred the population already is, outbred populations will have larger $\delta F$ than similar populations with higher initial $F$.

### Theoretical Expectations

As expected, selfing changes the estimate of $F$ as well.  It changes across generations with continued selfing as:

\[
F_1 = 1 - \frac{H_O}{H_E} = 1 - \frac{Q_1}{2pq}
\]

for the first generation, 

\[
F_2 = 1 - \frac{ \frac{Q_1}{2} }{2pq}
\]

for the second, 

\[
F_3 = \frac{\frac{Q_2}{2}}{2pq}  = \frac{\frac{\frac{Q_1}{2}}{2}}{2pq}
\]

and the next

\[
F_4 = \frac{\frac{Q_3}{2}}{2pq}  = \frac{\frac{\frac{Q_2}{2}}{2}}{2pq} =
\frac{\frac{\frac{\frac{Q_1}{2}}{2}}{2}}{2pq}
\]

and so on.  Numerically, if $F_1=0$, then $F2 = 0.50, F3 = 0.75, and F4 = 0.875.  Each generation, $F$ approaches 1.0 by half way.  From this pattern, the change in $F$ each generation can be estimated as:

\[
F_{t+1} = \frac{1}{2} + \frac{1}{2}F_t
\]

for each generation or 

\[
F_t = 1 - \left(\frac{1}{2}\right)^t(1-F_O)
\]

for any arbitrary time, $t$, in the future given some starting level of inbreeding.  This expectation looks like:

```{r warning=FALSE, message=FALSE}
library(ggplot2)
T <- 0:15
F <- 1 - 0.5^T
df <- data.frame( T, F)
ggplot(df,aes(T,F)) + geom_line() + xlab("Generation (t)") + ylab("Inbreeding (F)")
```

## Simulation Example

These smooth expectations are nice but real data is a bit more complex.  Real data is not so clean but it is pretty easy to simulate this process and measure the change in inbreeding through time.  Here is how that can be done.

First, I'm going to start with a population in HWE with allele frequencies of $p=q$.  To do this, we start by importing the gstudio library and making raw genotypes.  We can replicate them to make a starting population with $N=100$ individuals.

```{r message=FALSE, warning=FALSE}
library(gstudio)
AA <- locus( c("A","A") )
AB <- locus( c("A","B") )
BB <- locus( c("B","B") )
pop <- data.frame(ID=1:100, Locus1=c( rep(AA,25), rep(AB,50), rep(BB,25) ) )
```


Just to check, we can see the starting allele frequencies as:

```{r}
frequencies(pop)
```

The inbreeding parameter, $F$, is estimated on a data.frame object that has at least one column of data of type locus using the wrapper function `genetic_diversity()` and passing it the `mode="Fis"` option—the 'is' subscript here is because there are additional $F$ statistics that may be calculated for different groups of individuals which will be covered in the section on genetic structure.  The parameter $F$ is estimated for each locus and the results are given in a data.frame.

```{r}
genetic_diversity(pop,mode="Fis")
```

I am going to simulate 10 replicate runs of selfing, each starting with this very population.  During each replicate, each individual will produce one and only one offspring through selfing using the function `mate()` and iterated across generations as in the expectation.  The parameter F will be recorded after all replications are finished plot against time.

```{r}
df$Category = "Expectations"
for( rep in 0:9 ) {
  data <- pop
  F <- rep(0, length(T) )
  for(t in T){
    # estimate F
    F[(t+1)] <- genetic_diversity(data,mode="Fis")$Fis[1]
    # self all adults to make offspring
    data <- mate( data, data, N=1 )
  }
  df.rep <- data.frame( T, F, Category=paste("Replicate",rep))
  df <- rbind( df, df.rep )
}
```

The variance around expectation is moderate shows that even if all expectations are met, real data is a bit messy—perhaps expectations are more of what you'd call ‘guidelines' than strict rules.

```{r echo=FALSE, fig.cap="Replicate simulations of inbreeding in purely selfing populations (colored lines) with theoretical expectation (black line)."}
ggplot() + geom_line(aes(T,F,color=Category),data=df[df$Category!="Expectations",]) + geom_point(aes(T,F,color=Category),data=df[df$Category!="Expectations",]) + geom_line(aes(T,F),data=df[df$Category=="Expectations",]) + geom_point(aes(T,F),data=df[df$Category=="Expectations",]) + xlab("Generation (t)") + ylab("Inbreeding (F)")
```


<!--chapter:end:selfing.rmd-->

# Mixed Mating Systems  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_solanum.jpg"></div>

Not all species that can produce selfed offspring produce only selfed offspring.  In many systems, organisms may produce offspring that are a mix of inbred and outcross matings, a condition mixed mating systems.  Using the same approach, we can define expectations for allele frequencies (a cinch since they do not change), genotype frequencies (by mixing selfing and regular HWE expectations), and the evolution of the inbreeding parameter $F$.  

To quantify the effects of mixed mating systems, we need to define a new parameter, s, as the selfing rate.  This parameter determines the fraction of offspring due to selfing, with $s=0$ meaning all offspring are outcross, $s=1$ meaning they are all selfed, and intermediate values indicating a mix of both selfed and outcrossed offspring.

Given the selfing parameter, $s$, we can easily define the expectations for genotype frequencies at the next generation by adding the expectations for the purely selfed individuals ($s$) and the non-selfed individuals ($1-s$) as shown in the table below.

In the next generation, the expected frequency of $AA$ offspring would be:

\[
E[AA]_{t+1} = sP_t + s\frac{Q_t}{4} + (1-s)P_t^2 + (1-s)P_tQ_t + (1-s)\frac{Q_t^2}{4} 
\]

which simplifies down to:

\[
E[AA]_{t+1} = s\left[ P_t + \frac{Q_t}{4} \right] + (1-s)p_t^2
\]

In a similar fashion, the heterozygote is:

\[
E[AB]_{t+1} = s\frac{Q_t}{2} + (1-s)2p_tq_t
\]

and the other homozygote reduces to:
 
\[
E[BB]_{t+1} = s\left[ R_t + \frac{Q_t}{4} \right] + (1-s)q_t^2
\]

With these expectations, we can estimate the change in genotype frequencies through time for both allele and genotype frequencies. 

```{r}
p <- fAA <- fAB <- fBB <- rep(NA,10)
p[1] <- 0.25
fAA[1] <- p[1]^2
fAB[1] <- 2*p[1]*(1-p[1])
fBB[1] <- (1-p[1])^2
```

This produces the output of these frequency vectors as:


```{r}
cbind(p,fAA,fAB,fBB)
```

Then iterate through generations 2-10 (or as long as we need to go so that we can see the pattern) as:

```{r}
s <- 0.5
for(i in 2:10){
  fAA[i] <- s*(fAA[i-1] + fAB[i-1]/4) + (1-s)*p[i-1]^2
  fAB[i] <- s*fAB[i-1]/2 + (1-s)*2*p[i-1]*(1-p[i-1])
  fBB[i] <- s*(fBB[i-1] + fAB[i-1]/4) + (1-s)*(1-p[i-1])^2
  p[i] <- fAA[i] + fAB[i]/2
}
```

```{r echo=FALSE, fig.cap="Figure 4.3 Expectation for genotype frequencies for a 2-allele locus where $p=0.25$ and the selfing rate $s=0.5$.", message=FALSE, warning=FALSE}
library(ggplot2)
df <- data.frame( F=c(fAA,fAB,fBB), T=rep( 1:10, times=3), Genotype=rep(c("AA","AB","BB"), each=length(fAA)))
ggplot(df,aes(x=T,y=F,color=Genotype)) + geom_line() + geom_point() + xlab("Generation (t)") + ylab("Genotype Frequency") + ylim(0,1)
```


The resulting frequencies look to come to equilibrium at a reasonable speed.  The frequency of heterozygotes decreases from what would be expected under HWE yet does not go to zero.  The fraction of offspring that are the result of outcross matings ($1-s$) replenish heterozygotes preventing their entire loss.  After a few generations in this simple model, we see that the genotype frequencies can come into equilibrium.  


Here is an interactive widget you can play with to see how initial allele frequency and selfing rate interact in shaping genotype frequencies.

```{r echo=FALSE}
knitr::include_app("http://popgen.bio.vcu.edu/shiny/apg/mixed_mating/", height="750")
```

It would be possible to ‘cheat' a little on those estimations as $fBB[i] = 1-(fAA[i]+fAB[i])$ in the code above but I think that is one way to cover up potentially problematic bugs in your code.  All code has bugs—do not delude yourself into thinking otherwise.  One way we can have some confidence in the code we write that we know the frequencies of the three genotypes should sum to unity.

```{r}
fAA + fAB + fBB
```

Another way is to give the code parameters where we know what the outcome should be.  For example, if $s=0$, we should see any distribution of genotype frequencies tend towards Hardy-Weinberg Equilibrium.  In fact, it should tend very quickly towards HWE, right? 

In a fashion similar to selfing, we can use real data as well to examine genotype frequency change.  The gstudio library has a function `mixed_mating()` that allows you to specify the selfing rate s. I'll let you play with that one on your own time.

## Mixed Mating Genotype Frequency Equilibriums

In mixed mating systems, there is an intermediate equilibrium frequency for each genotype frequency created by the opposing forces of HWE in the ($1-s$) fraction of the offspring that tend towards p2, 2pq, and q2 and the tendency of pure selfing ($s$) to remove half the heterozygotes each generation.  

There are two equilibriums states that are possible:  

1. When we have fixed allele frequencies (e.g., $p=1.0$).  OK, this is kind of a boring example, but it is an equilibrium (as long as there is no mutation).  
2. The one we are interested in where $Q_t = Q_{t+1}$.  We can solve for this one by setting them equal in the expectation equation and solving for $\hat{Q}$. The part after the fourth line is just making it more simplified.
 
\[
\begin{aligned}
\hat{Q} &= s\frac{\hat{Q}}{2} + (1-s)2pq \\
 \hat{Q}-s\frac{\hat{Q}}{2} &= (1-s)2pq \\
\hat{Q}\left( 1 - \frac{s}{2} \right) &= (1-s)2pq \\
\hat{Q} &= \frac{(1-s)2pq}{1-\frac{s}{2}} \\
\hat{Q} &= \frac{2}{2}\frac{(1-s)2pq}{1-\frac{s}{2}} \\
\hat{Q} &= \frac{(1-s)4pq}{2-s}
\end{aligned}
\]

If this is a calculation that is necessary to use several times, it is easy to wrap it into a function as:

```{r}
equilbrium.Q <- function(p,s){
  q <- 1-p
  return(((1 - s) * 4 * p * q)/(2 - s))
}
```

and then you will be able to use it as necessary.

```{r}
equilbrium.Q(0.5,1)
equilbrium.Q(0.5,0)
equilbrium.Q(0.25,0.25)
```
The same approach can be used to estimate expected genotype frequencies for the homozygotes, setting for example $P_{t+1} = P_t$ 

\[
\begin{aligned}
\hat{P} & =  s(\hat{P}+\frac{Q}{4}) + (1-s)p^2 \\
\hat{P} & =  s\hat{P} + \frac{1}{4}sQ + (1-s)p^2\\
\hat{P} - s\hat{P} & =  \frac{1}{4}sQ + (1-s)p^2 \\
\hat{P}(1-s) & =  \frac{1}{4}sQ + (1-s)p^2 
\end{aligned}
\]

then solving for $\hat{P}$ (or $\hat{R}$ as necessary) gives the equilibrium genotype frequency

\[
\begin{aligned}
\hat{P} & =  \frac{s\frac{Q}{4}}{(1-s)} + p^2\\
\hat{P} & =  \frac{\frac{1}{4}s\frac{(1-s)4pq}{2-s}}{(1-s)} + p^2\\
\hat{P} & =  \frac{spq}{2-s} + p^2
\end{aligned}
\]

Since the genotype frequencies tend towards an equilibrium, then the inbreeding parameter, $F$, will also tend towards its corresponding equilibrium, $\hat{f}$.  The value of this equilibrium can be found by substituting the expected heterozygote frequency  

\[
\hat{Q}=\frac{(1-s)4pq}{2-s}
\]

into the inbreeding equation and putting it in place of $H_O$.

\[
F = 1 - \frac{H_O}{H_E}
\]

and simplifying in a few steps we get

\[
\hat{F} = \frac{s}{2-s}
\]

The rage at which $F$ goes from 0 to $\hat{F}$ can be estimated much like the derivation provided for changes in allele frequencies due to mutation. 

It is 

\[
F_t = \frac{s}{2-s}\left[ 1-\frac{s^t}{2} \right] + \left[\frac{s}{2}\right]^tF_0
\]

depending upon only the selfing rate, the starting inbreeding level, and the current generation.

Here is a little gadget to play with that allows you to manipulate $F_0$, $s$ and the number of generations ($T$).

```{r echo=FALSE}
knitr::include_app("http://popgen.bio.vcu.edu/shiny/apg/selfing_F/", height="850")
```


## Estimating Selfing Rates

As with most things in population genetics, these expectations provide nice descriptions of what we expect to see in the wild, but at the end of the day we are stuck with a bunch of raw genotypes and the need to find an estimation of $s$. In fact, it is usually the value of $s$ that we are most interested in seeking, not the value of $F$.   It is possible though, to rearrange the equilibrium expectation of $\hat{F}$, solving for s and thereby providing a first approximation for population-level selfing rates (or at least the equivalent selfing rate given a bunch of other expectations). 

\[
\hat{s} = \frac{2F}{1+F}
\]

In some cases, this approximation would make sense—cases such as a study of plants where selfing is a potential outcome of mating (e.g., Schemske & Lande 1985)—but may be somewhat meaningless (in a direct sense) to species whose mating system does not allow selfing (e.g., most non-plant species).  That said, the estimate $\hat{s}$ is an idealized parameter and because it is based solely upon inbreeding, $F$, it is an integrative parameter incorporating mating patterns not just during a single round of mating but as an accumulation across many mating events as well as other forms of non-random mating such as consanguinity.

```{r echo=FALSE, fig.cap="Expected equilibrium for inbreeding parameter, F, across potential values for the selfing rate, s."}
df <- data.frame( s=seq(0,1,by=0.02))
df$F <- df$s/(2-df$s)
ggplot( df, aes(x=s,y=F) ) + geom_line() + geom_point()
```


<!--chapter:end:mixed_mating.rmd-->

# Wahlund Effects  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_snake.jpg"></div>

Inbreeding may be indicated in the parameter $F$ due to selfing (the most direct type \@ref(selfing)), mating amongst relatives (the most common type, \@ref(mixed-mating-systems)), or due to problems that we as the investigator cause due to our ignorance of the underlying demography.  Yes, if done incorrectly, the way in which you sample individuals may may lead to incorrect inferences.  The Wahlund effect is one such way that problems in your study design may incorrectly result in inferences of inbreeding.  Here is how it works.

Consider two populations, both of which are at Hardy-Weinberg Equilibrium.  The first population has allele frequencies at a two allele locus of $p_A= 0.7;\;q_A=0.3$ and the second population has allele frequencies of $p_B=0.4;\;q_B=0.6$.  For simplicity, assume that both populations are completely isolated and have $N=100$ individuals each.  As such, we find the following genotypes in each population, each of which is exactly what is expected under HWE.

```{r echo=FALSE}
knitr::include_graphics("media/wahlundTwoPop.png")
```

If you did not know, *a priori*, these were isolated populations and artificially combined them, you would have a single composite populations created from two individual populations, both of which are in HWE.  It would look like this.

```{r echo=FALSE}
knitr::include_graphics("media/wahlundOnePop.png")
```


The question here is, given the artificial coalescence of two HWE populations, is the composite population also in HWE?  Lets see.

The observed allele frequencies from the 200 individuals are:

\[
\begin{aligned}
p_A & = \frac{2*65 + 90}{400} \\
& = 0.55 \\
p_B & = \frac{2*45 + 90}{400} \\ 
& = 0.45
\end{aligned}
\]

which would result in expectations under HWE for genotypes to be:

\[
\begin{aligned}
E[N_{AA}] &= p^2 * N \\
& = 0.55^2 * 200 \\
& = 60.5
\end{aligned}
\]

\[ 
\begin{aligned}
E[N_{Aa}] &= 2*p*q * N \\
& = 2*0.55*0.45 * 200 \\
& = 99
\end{aligned}
\]

and 

\[  
\begin{aligned}
E[N_{aa}] &= q^2 * N \\
& = 0.45^2 * 200 \\
& = 40.5
\end{aligned}
\]

which is not quite what we got in the population.  We observed fewer heterozygotes (and correspondingly more homozygotes) than expected.  A reduction in heterozygosity like this, would produce a F-statistic as:

\[
\begin{aligned}
F & = 1 - \frac{H_O}{H_E} \\
& = 1 - \frac{ \frac{N_{Aa}}{N} } { 2pq } \\
& = 1 - \frac{ 0.450 } { 0.495} \\
& = 0.091
\end{aligned}
\]

which is a level of inbreeding of almost 10%!  How can that be?  This messes everything up, how can you trust any inbreeding statistic?  In the most direct sense, you cannot know if what you are seeing is true inbreeding or an artifact of your crappy sampling scheme.  This is why we do base all of our inferences on a single parameter from a single analysis.  What you do consider is the totality of the data in light of what you know about the biology of the organism you are studying.  Is it reasonable that the spatial (or ecological) scale of your sample scheme could in fact encompass two completely isolated groups?  This is why biologists must know their study systems and we cannot just go out and blindly estimate parameters.  

The magnitude of the estimated $F$ is estimable and it is important to see how it works as we will return to this later when we look at population subdivision.  In our example, both population had equal numbers of individuals but deviated from each other in their allele frequencies. The rate of heterozygosity in our populations were:

\[
\bar{Q} = \frac{2p_Aq_A + 2p_Bq_B}{2}
\]

which simplifies to:
 
\[
\begin{aligned} 
\bar{Q} &= p_Aq_A + p_Bq_B \\
&= p_A(1-p_A) + p_B(1-p_B)
\end{aligned}
\]

Now this will always be smaller than 

\[
2\bar{p}\bar{q} = 2\left[\frac{1}{K}\sum_{i=1}^K p_i\right]\left[\frac{1}{K}\sum_{i=1}^K q_i\right]
\]

unless $p_A = p_B$.

In Edwards (2000) text, Foundations in Mathematical Genetics, he uses a ternary plot framework to describe this effect. A ternary plot is a way to visualize data corresponding to three different axes but displayed in two space, under the constraint that the frequencies in the three groups sum to unity.  In our case, we can use the genotype frequencies as the axes, one each for $E[AA]$, $E[Aa]$, and $E[aa]$.  Populations in Hardy-Weinberg equilibrium will fall on the line representing $p^2$, $2pq$, and $q^2$.  Here is how to visualize this in R.

```{r message=FALSE, warning=FALSE, fig.cap="A ternary plot, or de Finetti diagram after Edwards (2000), representing expected genotype frequencies under Hardy Weinberg equilibrium (line) along with the points for the two example populations (in red)."}
library(ggtern)
p <- seq(0,1,by=0.01)
q <- 1-p
df <- data.frame( AA=p^2,
                  Aa=2*p*q,
                  aa=q^2)
df.pop <- data.frame( AA=c(.49,.16),
                      Aa=c(0.42,0.48), 
                      aa=c(0.09,0.36),
                      pop=c("Pop-A","Pop-B"))
ggtern(df,aes(x=aa,y=Aa,z=AA)) + geom_line() + 
  geom_point(data=df.pop,color="red",size=3) + 
  geom_text(aes(x=aa-0.02,y=Aa+(0.1),z=AA,label=pop),data=df.pop,size=4) 
```

When your sampling design combines both of these populations into a synthetic population, you change the allele frequencies to $\bar{p}$ and $\bar{q}$.  However, since the change in expected genotype frequencies is not a linear change (notice how the expectation line in the figure curves) then the average allele frequencies produce an expectation of $2\bar{p}\bar{q}$ greater than the average of the observed heterozygosities.  This is demonstrated in the following image, the observed level of heterozygosity in the synthetic population, being the average of the two (where the dashed line intersects the expectation given for $\bar{q}$), is always going to be less than then the expected heterozygosity because the product of an average is always going to be less than the average of a product.

```{r echo=FALSE}
knitr::include_graphics("media/wahlundTernary.png")
```


The take home message here that for even the most basic of parameters, we need to be careful in our interpretations.  It is the totality of evidence, from many sources, that helps us understand the dynamics of the species we examine. 


<!--chapter:end:wahlund_effects.rmd-->

# (PART) Parent Offspring Data {-}

<!--chapter:end:part_parent_offspring.rmd-->

# Pedigrees  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_trees.jpg"></div>

Pedigrees are visual representations of genetic relations.  They are extremely important for estimating inbreeding and following traits or diseases in lineages.  In this lecture, we will learn:

1. Symbology used in pedigrees
2. Estimation of inbreeding *F* from a given pedigree
3. Approaches for plotting pedigrees

In the analysis of pedigrees, we often use the following terminology:

- **Consanguineous mating** (*lit.* "Of the same blood").  The condition where individuals who are related produce offspring.
- **Biparental Inbreeding** (see consanguineous mating).

A pedigree is simply a graphical representation characterizing the relationship amongst ancestral individuals.  This tool is very helpful for understanding the process of inbreeding when it occurs during a handful of generations.

Perhaps the most studied pedigree is that of the Hapsburg Dynasty, dating from 1516-1700, which controlled most of what we call modern Europe.  Some of the European Royalty were interbred to maintain familial relationships and consolidate power.  Examples include:

King                   | Queen                        |  Consanguineous Marriage
-----------------------|------------------------------|----------------------------
Philip I (1478-1506)   | Joanna I of Castile & Aragon | Third cousins 
Charles I (1500-1558)  | Isabella of Portugal          | First cousins
Philip II (1527-1598)  | Mary of Portugal             | Double first cousins
$\;$                   | Mary I of England            | First cousins once removed
$\;$                   | Anna of Austria              | Uncle-Niece
Philip III (1578-1621) | Margaret of Austria          | First cousins once removed
Philip IV (1605-1665)  | Elizabeth of Bourbon         | Third cousins
$\;$                   | Mariana of Austria           | Uncle-Niece
Charles II (1661-1700) | Marie Louise d'Orleans       | Second cousins


![Hapsburg Dynasty | Piliip I ($F=0.025$) $\to$ Charles II ($F=0.254$)](media/HapsburgPedigree.png)
```{r echo=FALSE,fig.cap="A pedigree of the Hapsburg Dynasty.  This pattern of mating resulted in individual inbreeding going from $F=0.025$ for Philip I to $F=0.254$ for Charles II."}
knitr::include_graphics("./media/HapsburgPedigree.png")
```

In this lineage, there were two genetic disorders that became paramount:  

1. Pituitary hormone deficiency  
2. Distal renal tubular acidosis


Pedigree Symbology

```{r echo=FALSE}
knitr::include_graphics("media/pedigree1.png")
```


1. Each row is a generation.
2. Lines within a generation represent mating events that result in offspring.
3. Lines between generations represent descent.
4. Individuals are labeled uniquely.
5. Sex is indicated by shape (square=male, circle=female, diamond=unknown).
6. Traits can be mapped onto the pedigree using additional colors & symbols


```{r echo=FALSE}
knitr::include_graphics("./media/pedigree0.png")
```


Extraneous individuals may be removed from the depiction.  Why is it that **C** and **F** are not shown on the pedigree on the right side?


Estimating Inbreeding From Pedigrees

The inbreeding coefficient, *F*, of an **individual** is determined by the probability that a pair of alleles carried by gametes are IBD from a recent common ancestor.

```{r echo=FALSE}
knitr::include_graphics("./media/pedigree2.png")
```


- **F** has same grandfather on both sides.   
- **D** and **E** are half-sibs sharing father **B**.   
- **B** passed one of his alleles, say $A_1$, to both **D** and **B**.
- **D** passed allele $A_1$ to **F** and **E** passed allele $A_1$ to **F**, which means
- **F** has some non-zero probability of being autozygous


Expectations for $F$ in a pedigree.


```{r echo=FALSE}
knitr::include_graphics("./media/pedigree2.png")
```


- Label alleles in **B** as $\alpha$ and $\beta$.   
- The alleles **B** gives to **D** & **E** are $\{\alpha, \alpha\}$, $\{\alpha, \beta\}$, $\{\beta, \alpha\}$, or $\{\beta, \beta\}$   
- Each potential allele pair occurs at: $\left(\frac{1}{2}\right)^2 = \frac{1}{4}$   
- If **B** gives $\{\alpha,\alpha\}$ or $\{\beta, \beta\}$ then the two alleles in the children of **B** are autozygous (by definition).   
- If **B** gives $\{\alpha,\beta\}$ or $\{\beta, \alpha\}$ then the two alleles in the children of **B** are autozygous *only if* the alleles $\alpha$ and $\beta$ are autozygous.   
- The probability of $\alpha$ and $\beta$ being autozygous in **B** is given by the inbreeding coefficient $F$ for **B** (denoted $F_B$).


## Expectations for F in a pedigree.

```{r echo=FALSE}
knitr::include_graphics("./media/pedigree2.png")
```


Transition probabilities for all cases of alleles within the children of **B** are:    

- $P({\alpha,\alpha}|\mathbf{B}) = \frac{1}{4}$   
- $P({\alpha,\beta}|\mathbf{B}) = \frac{1}{4}F_A$   
- $P({\beta,\alpha}|\mathbf{B}) = \frac{1}{4}F_A$   
- $P({\alpha,\alpha}|\mathbf{B}) = \frac{1}{4}$ 


```{r echo=FALSE}
knitr::include_graphics("./media/pedigree2.png")
```


![pedigree2](media/pedigree2.png)


Total probability for **B** donating autozygous alleles to his offspring are then:   
\[
 = \frac{1}{4} + \frac{1}{4}F_A + \frac{1}{4}F_A + \frac{1}{4} \\
 = \frac{1}{2} + \frac{1}{2}F_A \\
 = \frac{1}{2}(1.0 + F_A) \\
\]


Expectations for $F$ in a pedigree.

```{r echo=FALSE}
knitr::include_graphics("./media/pedigree2.png")
```


- The transition probabilities for the alleles that **D** & **E** received from **B** and passes on to **F** are $\frac{1}{2}$ for each.   
- The total probability that the two alleles in **F** are autozygous (Identical By Decent; IBD) is:   
\[
F = \frac{1}{2}(1.0 + F_A) \frac{1}{2} \frac{1}{2} \\
   = \left( \frac{1}{2} \right)^3 (1.0 + F_A)
\]


Generalized Formula for Inbreeding | Chain Counting Method

```{r echo=FALSE}
knitr::include_graphics("./media/pedigree2.png")
```

\[
F  = \left( \frac{1}{2} \right)^i (1.0 + F_A)
\]
where the exponent $i$ is the length of the *ancestral loop* (e.g., the number of steps starting at **F** and going trough the common ancestor and back to **F**) which in this case is **D<u>B</u>E** (the common ancestor is underlined).


Several Ancestral Loops

```{r echo=FALSE}
knitr::include_graphics("./media/pedigree4.png")
```


If there is more than one ancestral loop, then the final inbreeding coefficient, *F*, is the sum of the *F* estimated across each loop (assuming there are $K$ different loops in the pedigree).  In this example you would use both **GD<u>A</u>EH** and **GD<u>B</u>EH**.   
\[
F  = \sum_{k=1}^{K} \left[ \left( \frac{1}{2} \right)^{i_k} (1.0 + F_k) \right]
\]

N.B. Each ancestor may have different $F_k$ values.  


Strategies for Pedigrees
 
Here is a basic strategy for solving pedigree questions. 
 
1. Draw the pedigree    
2. Identify common ancestor(s)    
3. Trace ancestral loop(s)    
4. Annotate all ancestral loops indicating the common ancestor    
5. The length of each ancestral loop is $i$    
6. Plug into the equation for inbreeding statistic *F* for each loop.    
7. Final *F* is sum of all estimated *F* from each loop.


Easy Example, F=?

```{r chunk-pediree, message=FALSE}
library(kinship2)
id    <- LETTERS[1:6]
dadid <- c(NA,NA,NA,"B","B","D")
momid <- c(NA,NA,NA,"A","C","E")
sex   <- c("female","male","female","male","female","female")
ped <- pedigree( id, dadid, momid, sex)
ped
plot(ped)
```

- Ancestral loop: **D<u>B</u>E**, 
- $i = 3$, 
- $F_F = \left( \frac{1}{2} \right)^3(1+F_A)$.


Easy Example, <font class="alert">F=?</font>

```{r, message=FALSE}
library(kinship2)
id    <- LETTERS[1:9]
momid <- c(NA,NA,NA,"B","B",NA,"C","E","H")
dadid <- c(NA,NA,NA,"A","A",NA,"D","F","G")
sex   <- c("male","female","female","male","female","male","male","female","female")
ped <- pedigree( id, dadid, momid, sex)
plot(ped)
```

- Ancestral loop: **GD<u>A</u>EH**, **GD<u>B</u>EH**, $F_I = \left[ \left( \frac{1}{2} \right)^5(1+F_A) \right] + \left[ \left( \frac{1}{2} \right)^5(1+F_B) \right]$.


Medium Example, <font class="alert">F=?</font>

```{r, warning=FALSE}
id    <- LETTERS[1:8]
dadid <- c(NA,NA,"A","A",NA,"D","D","F")
momid <- c(NA,NA,"B","B",NA,"C","E","G")
sex   <- c("male","female","female","male","female","male","female","unknown")
ped <- pedigree( id, dadid, momid, sex)
suppressWarnings(plot(ped))
```

- Ancestral loop: **GD<u>A</u>EH**, **GD<u>B</u>EH**, $F_I = \left[ \left( \frac{1}{2} \right)^5(1+F_A) \right] + \left[ \left( \frac{1}{2} \right)^5(1+F_B) \right]$.


## Drawing Pedigrees


Example Pedigree

```{r,message=FALSE}
library(kinship2)
id    <- LETTERS[1:5]
dadid <- c(NA,NA,"A","A","A")
momid <- c(NA,NA,"B","B","B")
sex   <- c("male","female","male","female","female")
data <- data.frame( id, dadid,momid,sex)
data
```


Example Pedigree 

```{r}
ped <- pedigree(data$id, data$dadid, data$momid, data$sex)
ped
summary(ped)
```


Example Pedigree

```{r}
plot.pedigree(ped)
```


![Pedigree3](media/pedigree3.png)


Decay of $F$, The effects of size in the ancestral loop

```{r message=FALSE}
library(ggplot2)
df <- data.frame( i=seq(4,20,by=2) )
df$F <- 0.5^df$i
ggplot( df, aes(x=i,y=F) ) + geom_line(color="red") 
```


## Skills
  
This lecture covered the creation and analysis of pedigree data.  At the end of this lecture you should be able to:

1. first skill
2. second skill


```{r}
id    <- LETTERS[1:9]
momid <- c(NA,NA,NA,"B","B",NA,"C","E","H")
dadid <- c(NA,NA,NA,"A","A",NA,"D","F","G")
sex   <- c("male","female","female","male","female","male","male","female","female")
brown_eyes <- c(1,0,0,1,1,NA,1,1,0)
college <- c(0,1,0,0,0,0,1,1,1)
likes_asparagus <- c(0,1,0,0,1,1,0,1,1)
traits <- cbind( brown_eyes, college, likes_asparagus)

ped <- pedigree( id, dadid, momid, sex, affected = traits)
plot(ped)
pedigree.legend( ped, location="bottomleft",radius=.2)
```


```{r}
library(gstudio)
p1 <- c("Ai","Aj")
p2 <- c("Ak","Al")
offs <- c(  locus( c(p1[1],p2[1]) ), 
            locus( c(p1[2],p2[1]) ), 
            locus( c(p1[1],p2[2]) ),
            locus( c(p1[2],p2[2]) ) )
offs
```


Relationship     | $k_0$ | $k_1$ | $k_2$ | $r$ 
-----------------|-------|-------|-------|-----
Identical Twins  | 0     | 0     | 1     | 1   
Full Sibs        | 0.25  | 0.5   | 0.25  | 0.5 
Parent Offspring | 0     | 1     | 0     | 0.5 
Half Sib         | 0.5   | 0.5   | 0     | 0.25 
Aunt-nephew      | 0.5   | 0.5   | 0     | 0.25 
First Cousin     | 0.75  | 0.25  | 0     | 0.125 
Unrelated        | 1     | 0     | 0     | 0 

sdf


```{r}
ped <- pedigree( id, dadid, momid, sex, affected = traits)
plot(ped)
pedigree.legend( ped, location="bottomleft",radius=.2)
kinship(ped)
```


<!--chapter:end:pedigrees.rmd-->

# Parentage Analyses   {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_butterfly.jpg"></div>


```{r echo=FALSE, message=FALSE, warning=FALSE}
library(knitr)
library(ggplot2)
library(gstudio)
```


The analysis of parental and offspring data 

1. Parentage-type analyses.  
2. Analyses based upon multiple paternity


Parentage analyses are used in a broad range of studies:

* Identical vs fraternal twins  
* Agricultural crop line differentiation  
* Differentiate between livestock/dog/cat breeds  
* Pathogenic strain identification (e.g., Hep-C strains *a-e*)  
* Assigning parentage to individuals  
* Assigning individuals to populations  
* Identifying the source of unknown tissues (e.g., gettin' perps)

```{r echo=FALSE}
knitr::include_graphics("media/paternity.png")
```


A statistical approach for identifying the parent(s) of a particular individual.  This requires:

1. A set of genetic markers that are bi-parentally inherited
2. Variation in these markers
3. Some assumptions about the *prior* probability of the union of parents.

## Paternity vs. Maternity

For single parent parentage analysis, it is either paternity or maternity that is being established.  Here we assume that the other parent is definitely the biological parent of the individual (e.g., $P(prior)=1$).  This can be because:

1. The offspring was collected from the identified parent
2. There is other evidence that points to the identified parent being the biological one.

The goal then is to determine who the unidentified parent is with some level of statistical probability.


Terms Used in Parentage 

The following terms are commonly used in parentage analyses:  

**Extra-Pair Paternity** - Fertilization resulting from copulation outside a recognized pair bond.  

**Multiple Paternity** - Offspring produced from mating events with different sets of individuals.  

**Paternity/Maternity Exclusion** - Excluding an individual based upon an in-congruence in observed genetic data.


## Probability of Exclusion

Mother              | Offspring                 | Excluded Dads       | Probability
--------------------|---------------------------|---------------------|--------------
$A_1A_1\;(p^2_1)$   | $A_1A_1\;(p_1)$            | $A_2A_2\;(p_2^2)$ 	| $p_1^3p_2^2$
$A_1A_1\;(p^2_1)$   | $A_1A_2\;(p_2)$           | $A_1A_1\;(p_1^2)$ 	| $p_1^4p_2$
$A_1A_1\;(p^2_1)$   | $A_2A_2\;(0)$             | -                   |    -
$A_1A_2\;(2p_1p_2)$ | $A_1A_1\;(\frac{p_1}{2})$ | $A_2A_2\;(p_2^2)$ 	| $p_1^2p_2^3$
$A_1A_2\;(2p_1p_2)$ | $A_1A_2\;(\frac{1}{2})$  	| -                 	| -        
$A_1A_2\;(2p_1p_2)$ | $A_2A_2\;(\frac{p_2}{2})$	| $A_1A_1\;(p_1^2)$ 	| $p_1^3p_2^2$
$A_2A_2\;(p^2_2)$ 	| $A_1A_1\;(0)$             | -                   | -
$A_2A_2\;(p^2_2)$   | $A_1A_2\;(p_1)$           | $A_2A_2\;(p_2^2)$   | $p_1p_2^4$
$A_2A_2\;(p^2_2)$   | $A_2A_2\;(p_2)$           | $A_1A_1\;(p_1^2)$   | $p_1^2p_2^3$


Probability of Exclusion

\[
P_{exc} = p_1^3p_2^2 + p_1^4p_2 +p_1^2p_2^3 +p_1^3p_2^2 +p_1p_2^4 +p_1^2p_2^3 \\
\]

which when simplified down a bit becomes

\[
P_{exc} = p_1p_2(1-p_1p_2)
\]


Single Locus Paternity Exclusion

```{r chunk-parentage}
p <- seq(0,1,by=0.02)
q <- 1-p
Pexcl <- p*q*(1-p*q)
plot(Pexcl ~ p, xlab="Allele frequency, p", ylab="Paternity Exclusion Probability")
```


Multilocus Exclusion

Exclusion probabilities are multiplicative properties.

\[
P_{excl} = 1 - \prod_{i=1}^\ell(1-P_{excl,\;i})
\]


Example

```{r}
p <- rep( 0.5*0.5*(1-0.5*0.5), 5 )
p
ptot <- 1
for( i in 1:length(p))
  ptot <- ptot * (1-p[i])
1-ptot
```


Mother              | Offspring                     | Excluded Father $A_xA_y$        | Probability of Exclusion  
--------------------|-------------------------------|---------------------------------|------------------------------------
$A_iA_i\;(p_i^2)$   | $A_iA_i\;(p_i)$               | $x,y\ne i\;\; (1-p_i)^2$        | $p_i^3(1-p_i)^2$ 
$A_iA_i\;(p_i^2)$   | $A_iA_j\;(p_j)$               | $x,y\ne j\;\; (1-p_j)^2$        | $p_i^2p_j(1-p_i)^2$  
$A_iA_j\;(2p_ip_j)$ | $A_iA_i\;(\frac{p_i}{2})$     | $x,y \ne i\;\; (1-p_i)^2$       | $p_i^2p_j(1-p_i)^2$ 
$A_iA_j\;(2p_ip_j)$ | $A_iA_j\;(\frac{p_i+p_j}{2})$ | $x,y \ne i,j\;\; (1-p_i-p_j)^2$ | $p_ip_j(p_i+p_j)(1-p_i-p_j)^2$  
$A_iA_j\;(2p_ip_j)$ | $A_iA_k\;(\frac{p_k}{2})$     | $x,y \ne k\;\; (1-p_k)^2$       | $p_ip_jp_k(1-p_k)^2$ 
$A_iA_j\;(2p_ip_j)$ | $A_jA_k\;(\frac{p_k}{2})$     | $x,y \ne k\;\; (1-p_k)^2$       | $p_ip_jp_k(1-p_k)^2$  
$A_iA_j\;(2p_ip_j)$ | $A_jA_j\;(\frac{p_j}{2})$     | $x,y \ne j\;\; (1-p_j)^2$       | $p_ip_j^2(1-p_j)^2$  

## Paternity Exclusion


Likelihood Ratios

A likelihood ratio is given by:

\[
LR = \frac{H_P}{H_D}
\]

where the $H_X$ values are the hypotheses probabilities.

Nomenclature For Parentage

Individual           |  Identifier  | Genotype 
---------------------|--------------|------------
Female Parent 			 | $FP_i$       | $\alpha_i$ 
Putative Male Parent | $MP_j$       | $\beta_j$ 
Offspring 				   | $O_k$        | $\gamma_k$ 

$\;$

**Paternal Probability** The posterior odds of paternity versus non-paternity given the totality of genetic information.


Likelihood Ratios | Genetic Equivalences

The likelihood of one hypothesis, $H_1$ relative to another $H_2$ is:

\[
L(H_1,H_2|D) = \frac{P(D|H_1)}{P(D|H_2)}
\]

where

\[
P(D|H) = T(\gamma | \alpha, \beta)P(\alpha)P(\beta)
\]


Assuming $H_1:$ states that $\beta$ is the real father of $\gamma$ on $\alpha$ and $H_2:$ states that he is just a random individual in the population is:

\[
L(H_1,H_2|\alpha,\beta,\gamma)  = \frac{P(D|\alpha,\beta,\gamma)}{P(D|\alpha,\gamma)}  
\]

which can be simplified to:

\[
\lambda_j = \frac{P(\alpha_i,\beta_j,\gamma_k|\mathrm{paternity})}{P(\alpha_i,\beta_j,\gamma_k|\mathrm{non-paternity})} \\ 
  		    = \frac{P(\beta_j)P(\alpha_i)T(\gamma_k|\alpha_i,\beta_j)}{P(\beta_j)P(\alpha_i)T(\gamma_k|\alpha_i)} \\ 
				  = \frac{T(\gamma_k|\alpha_i,\beta_j)}{T(\gamma_k|\alpha_i)}
\]

where $T(X|Y)$ is the Mendelian transition probability of offspring $X$ given parent $Y$.


Assumptions in Model of Paternity Likelihood

The basic paternity exclusion model assumes:

1. Completely random mating  (can be modified by changin *priors*)
2. Independent assortment of alleles


Likelihood Example

Consider the maternal individual whose genotypes are:
\[
FP = \{AA,\;Bb,\;CC,\;Dd\}
\]

Whose $i^{th}$ offspring has the genotypes:

\[
O_i = \{AA,\;BB,\;Cc,\;dd\}
\]	


Likelihood Example | $T(O|FP)$

The transition probability, $T(O|FP)$, is then:
	
Individual | Locus1 | Locus2 | Locus3 | Locus4
-----------|--------|--------|--------|--------
$FP$  	   | $AA$   | $Bb$   | $CC$   | $Dd$ 
$O_i$ 	   | $AA$   | $BB$   | $Cc$   | $dd$ 


$T(O|FP) = 1*0.5*1*0.5 = 0.25$


Likelihood Example | Putative Male Parents

  
Individual | Locus1 | Locus2 | Locus3 | Locus4
-----------|--------|--------|--------|--------
$MP_1$     | $Aa$   | $BB$   | $cc$   | $Dd$ 
$MP_2$ 	   | $AA$   | $BB$   | $Cc$   | $dd$ 

$\;$

Which one of the potential fathers is the <font class="alert">most likely</font> parent?


Likelihood Example | First Putative Father


Individual   | Locus1 | Locus2 | Locus3 | Locus4
-------------|--------|--------|--------|--------
$FP$  	     | $AA$   | $Bb$   | $CC$   | $Dd$ 
$MP_1$       | $Aa$   | $BB$   | $cc$   | $Dd$ 
$O_i$ 	     | $AA$   | $BB$   | $Cc$   | $dd$ 
$T(O|FP,MP)$ | 0.5    | 0.5    | 1.0    |  0.25

\[
T(O_1|FP,MP_1) = 0.5 * 0.5 * 1.0 * 0.25 = 0.0625
\]

And

\[
\lambda_1 = \frac{T(O_i|FP,MP_1)}{T(O_i|FP)} = \frac{0.0625}{0.25} = 0.25
\]


 Likelihood Example | Second Putative Father


Individual   | Locus1 | Locus2 | Locus3 | Locus4
-------------|--------|--------|--------|--------
$FP$         | $AA$   | $Bb$   | $CC$   | $Dd$ 
$MP_2$       | $AA$   | $BB$   | $Cc$   | $dd$ 
$O_i$ 	     | $AA$   | $BB$   | $Cc$   | $dd$ 
$T(O|FP,MP)$ | 1.0    | 0.5    | 0.5    |  0.5

\[
T(O_1|FP,MP_2) = 1.0 * 0.5 * 0.5 * 0.5 = 0.125
\]

And

\[
\lambda_2 = \frac{T(O_i|FP,MP_1)}{T(O_i|FP)} = \frac{0.125}{0.25} = 0.5
\]

 Likelihood Example | Interpretation of Results {.build}

Most likely parent is $MP_2$ because $\lambda_2 = 0.5 > \lambda_1 = 0.25$.

$\;$

Does this mean that $MP_2$ is the <font class="alert">real</font> parent?


Likelihood Example | In Class Exercise - Whose the daddies?

Individual  | Locus 1  | Locus 2  | Locus 3 
------------|----------|----------|---------  
Mother      | $A_1A_1$ | $B_1B_3$ | $C_1C_1$    
Offspring 1 | $A_1A_2$ | $B_1B_3$ | $C_1C_2$  
Offspring 2 | $A_1A_1$ | $B_3B_3$ | $C_1C_1$  
Offspring 3 | $A_1A_1$ | $B_1B_1$ | $C_1C_1$  
Dad 1       | $A_1A_2$ | $B_2B_3$ | $C_1C_1$  
Dad 2       | $A_2A_2$ | $B_1B_1$ | $C_1C_2$  
Dad 3       | $A_1A_1$ | $B_2B_3$ | $C_1C_2$  
Dad 4       | $A_1A_1$ | $B_1B_1$ | $C_2C_2$  

    
```{r}
library(gstudio)
loci <- c("Locus-A","Locus-B","Locus-C","Locus-D")
freqs <- data.frame(Locus = rep(loci, each = 4), 
                    Allele = rep(LETTERS[1:4], times = 4), 
                    Frequency = 0.25)
freqs
```


```{r}
adults <- make_population( freqs, N=100 )
adults$OffID <- 0
adults <- adults[ , c(1,6,2:5)]
adults[1:5,]
```


```{r}
offs <- data.frame()
mom <- adults[1,]
for( i in 1:20){
  dad_id <- runif( 1, min=2, max=100)
  dad <- adults[dad_id,]
  off <- mate( mom, dad, N=1 )
  offs <- rbind( offs, off )
}
offs$OffID <- 1:20
offs[1:5,]
```


```{r}
data <- rbind( adults, offs )
data <- data[ order(data$ID,data$OffID),]
rownames(data) <- 1:nrow(data)
data[1:10,]
```


```{r}
f <- frequencies( data[ data$OffID==0,] )
excl <- exclusion_probability( f )
excl
p <- excl$Pexcl
excl_multilocus <- 1 - prod( 1-p )
excl_multilocus
```


```{r}
family <- data[ data$ID==1, ]
minus_mom( family )
```


```{r}
dads <- adults[2:100,]
mom <- adults[1,]
off <- offs[1,]
for( i in 1:nrow(dads)){
  dad <- dads[i,]
  T <- transition_probability(off,mom,dad)
  if( T > 0 )
    cat("Father",i,"may be the real father (T =",T,")\n")
}
```


## Fractional Paternity

In cases where we have more than one putative father, we may want to get an idea of the relative strength of our inferences by comparing the likelihood ratios for all dads.  

1. We may use arbitrary cut-offs, or  
2. We may use all non-excluded dads, but weighted by their fractional contributions


Conditional Probability 

**Problem:** We have several putative fathers ($MP_i, MP_j, MP_k, ... , MP_m$) have been found to have non-zero likelihoods of paternity.

$\;$

**Question:** What is the relative likelihood of paternity given these putative fathers?


Conditional Probability 

Conditional probability determines the likelihood of an event (paternal likelihood) <font class="alert">given</font> that some other event has already happened (not excluded as a potential father).

\[
P(MP=j^*|FP=i,O=k) = \frac{P(O=k|FP=i,MP=j^*)P(MP=j^*|FP=i)}{\sum_{\forall j}P(O=k|FP=i,MP=j^*)P(MP=j^*|FP=i)}
\]

If we can assume that $P(MP=j|FP=i) = c$ (e.g., the frequencies of the female and male parents are constant with respect to the individual offspring being considered) then,

\[
P(MP=j^*|FP=i,O=k) = \frac{T(\gamma_k|\alpha_i,\beta_j^*)}{\sum_{\forall k}T(\gamma_k|\alpha_i,\beta_k^*)}
\]


Fractional Paternity 

Some things to consider when using fractional analyses for paternity.

1. Not usually used in human studies.  
2. Can be considered a *prior probability of paternity*.  
3. Can include ecological, spatial, evolutionary components such as differential attractiveness, pollen fertility, output, etc.
4. Possible tautology

Every potential father is assigned paternity, the fraction of $X_{ik}$ on a particular $MP_j$ is proportional to the likelihood ratio.


```{r}
frac <- paternity( offs, mom, dads )
summary(frac)
frac[1:10,]
```


```{r}
t <- table(frac$OffID)
t
N_pexcl <- nrow(dads)*(1-excl_multilocus)
fit <- t.test(as.numeric(t),mu = N_pexcl)
fit
```


Maternity Analysis

Putative father identified by mother unknown.

\[
\lambda_i = \frac{P(\gamma|\beta)P(\beta)}{P(\gamma)P{\beta})} \\
  		    = \frac{P(\gamma|\beta)}{P(\gamma)} 
\]

where $P(\gamma)$ is the frequency of the offspring genotype in the population.  All other things are the same.


Cryptic Gene Flow

Consider the case where:

1. You have identified a set of offspring collected from mothers.
2. Identified a set of fathers that are probabilistically sires of the offspring.


## Dispersal Kernels

Estimating the Disperal Distribution

Once a collection of paternity estimates have been determined, you can use them to estimate a *dispersal kernel*, describing the probability of paternity as a function of distance from the maternal individual.

```{r message=FALSE,echo=FALSE,fig.align='center'}
dist <- data.frame( Distance=rchisq(10000,df = 3))
ggplot(dist, aes(x=Distance)) + geom_density(fill="#99d8c9") + xlab("Physical Distance") + ylab("Probability of Paternity") 
```


Dispersal Kernels | Distributions

The <font class="alert">form</font> of the distribution is critical for estimation.  It determines:

> 1. The shape of the distribution  
> 2. The variance of the distribution
> 3. Quantitative estimates and hypotheses you get from the data


Example Kernel Distribution Families

*Normal Family*

\[
p(a|x,y) = \frac{1}{\pi a^2}exp\left[-\left( \frac{r}{a} \right)^2\right]
\]

where $r = \sqrt{ x^2 + y^2}$ and $a = \sigma \sqrt{2}$.

$\;$

This produces a *thin tailed* distribution.


Example Kernel Distribution Families

*Exponential Family*

\[
p(a,b|x,y) = \frac{b}{2\pi a^2 \Gamma(2/b)}exp\left[ -\left( \frac{r}{a} \right)^b \right]
\]

where $\Gamma(a,b)$ is the gamma function and $b$ is a 'shape' parameter.  

1. When $b=1$ This is the exponential distribution.  
2. When $b=2$ this is the normal function.  
3. When $b<1$ this is a fat-tailed distribution.


Example Kernel Distribution Families

Other distributions you may run across include:

1. The Geometric distribution,  
2. The Weibull family of distributions,   
3. The 2Dt family of distributions.


```{r}
r1 <- abs(rnorm(10000))
r2 <- rexp(10000)
df <- data.frame( Distribution=c(rep(c("Normal","Exponential"),each=10000)), Value=c(r1,r2))
library(ggplot2)
ggplot(df,aes(x=Value,fill=Distribution)) + geom_density(alpha=0.75) + theme_bw() + ylab("Frequency")
```


Concerns with kernel estimation

The following are some assumptions that are inherent in the use of dispersal kernels for estimating connectivity.

1. All functions are continuous, 
2. All functions assume isotropy in dispersal,   
3. All functions explicitly assume homogeneity of the dispersal matrix.


Skills
  
In this lecture we covered some rather simple parent/offspring relationships and how we can analyze them.  Specifically, you should be comfortable with:

1. Understanding the qualities of loci that make for more powerful parentage analyses.  
2. Be able to estimate single and multilocus exclusion probabilities and understand what they mean.  
3. Estimate likelihood ratios for paternity given Mother, Offspring, and Putative Male Parent.  
4. Use fractional paternity and understand conditional probability and how it applies to parentage.
5. Understand dispersal kernel estimation.

<!--chapter:end:parentage.rmd-->

# (PART) Subdivided Populations {-}

<!--chapter:end:part_subdivided_populations.rmd-->

# Mapping Populations {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_lake.jpg"></div>

Almost every data set we will work on has a spatial context.  Until relatively recently, this was largely ignored and at most given as a 2-dimensional projection of sampling locations on a black and white map in our manuscripts.  Here we explore methodologies to provide more intuitive approaches for plotting the spatial location of our sampling sites.

## Procedure

First, we need to load in the proper libraries. Each of the libraries are needed for:

 - **gstudio** - Contains the *A. attenuatus* data set and there are built-in routines that allow you to plot populations in various formats or export the data in a format for plotting in other systems (e.g., GoogleEarth, ArcGIS, etc).
 - **ggmap** - Allows the plotting of tiles from a googleMap.
 - **leaflet** - Loads in the javascript leaflet library and allows the interactive visualization.


## Plotting Population Locations

We will start by loading in the default data set for `gstudio` and using methods in that package for producing spatial plots.

```{r,message=FALSE,warning=FALSE}
# load in the libraries and grab the data
library(gstudio)
library(ggmap)
data(arapat)

# pull out the coordinates - Population is the default stratums
coords <- strata_coordinates(arapat)
coords$Stratum <- as.character( coords$Stratum )
coords[1:5,]

# make a map (grabs from google) and plot it
map <- population_map(coords)
ggmap( map ) + geom_point( aes(x=Longitude,y=Latitude),data=coords)
```

We can extend this basic plotting by adding text to the plots indicating the population names (here I offset the coordinates by 0.1 & 0.12 degrees—a cheap trick but it works pretty well here).

```{r,warning=FALSE}
p <- ggmap( map ) + geom_point( aes(x=Longitude,y=Latitude),data=coords) 
p + geom_text( aes(x=Longitude+0.1,y=Latitude+0.12,label=Stratum), data=coords)
```


## Population Allele Frequencies

It is also possible to plot the data in a spatial context.  Here is an example of how to mix `ggplot()` and `ggmap()` data and I'll plot the locations as proportional in size to the allele frequency.

```{r}
# Break out allele frequencies for LTRS by population 
f <- frequencies(arapat,loci="LTRS",stratum="Population")

# lets only use the allele 01 
f <- f[ f$Allele=="01",]

# here are what the data look like
f[1:10,]
```

and then use those frequencies as image size.

```{r warning=FALSE, message=FALSE}
# merget the frequencies into the coordinate data.frame
coords <- merge( coords, f[,c(1,4)])
ggmap( map ) + geom_point( aes(x=Longitude,y=Latitude,size=Frequency),data=coords)
```

or for observed effective allele size

```{r warning=FALSE, message=FALSE}
# grab effect for LTRS locus
ae <- genetic_diversity(arapat, stratum = "Population", mode="Ae")
ae <- ae[ ae$Locus=="LTRS",]

# merge in the data
coords <- merge( coords, ae[,c(1,3)])
ggmap( map ) + geom_point( aes(x=Longitude,y=Latitude,size=Ae),data=coords)
```

There is also the option to make use of some pie charts.  I know, pie charts suck and any statistician will tell you that they should probably not be used because they can be misleading, but here they are.  For exploratory data analysis, they can be very insightful at times.  Here is the frequency of alleles at the *enolase* locus in *Araptus*.  Any spatial structuring catch your eye?


```{r,warning=FALSE,message=FALSE,fig.width=7,fig.height=7, eval=FALSE}
pies_on_map( arapat, stratum="Population", locus="EN")
```

Which will open a new browser window and produce a graph like the one below.

<iframe width="640" height="480" src="media/pies_on_map.html" frameborder="0" allowfullscreen></iframe>

Note the messages about the approximation.  This is because the google maps API has an integer for zoom factor and at times it is not able to get all the points into the field of view using an integer zoom.  If this happens to you, you can manually specify the *zoom* as an optional argument to either function `pies_on_map()` or `population_map()`.  You also need to be careful with the `pies_on_map()` function because the way it works is that the background tile is plotted and then I plot the pies ontop of it.  If you reshape your plot window outside equal x- and y- coordinates (e.g., make it a non-square figure), the spatial location of the pie charts will move!  This is a very frustrating thing but it has to do with the way viewports are overlain onto graphical objects in R and I have absolutely **no** control over it.  So, the best option is to make the plot square, export it as PNG or TIFF or whaterver, then crop as necessary.


## Dynamic maps

We can also produce dynamic maps for viewing in html output (like ebooks or webpages).  Here I produce a leaflet map and then add markers on to the file using coordinates and put the clicked popup as the population name.  This is an *interactive* map, you can zoom in, etc.

```{r,message=FALSE,warning=FALSE}
library(leaflet)
map <- addTiles( leaflet( data=coords) )
map <- addMarkers( ~Longitude, ~Latitude, popup=~Stratum , map=map)
map
```

You notice the interesting `%>%` operator?  This is short-hand that may be very helpful for you.  It is part of the `magrittr` library, and is often imported through `dplyr` and not directly.  However, what it does is allow us to be a bit more succinct  that does essentially the same thing as 

You can change the underlying base map by passing on other Tiles to the construction.  A list of available tile providers can be found [here](http://leaflet-extras.github.io/leaflet-providers/preview/index.html).

```{r}
map <- leaflet( data=coords )
map <- addProviderTiles("MapQuestOpen.Aerial", map=map)
addMarkers( ~Longitude, ~Latitude, popup=~Stratum, map=map)
```

Or you can add custom images for the map relative to some meaningful information.

```{r}
coords$Size <- log(coords$Ae) * 10 + 1
addCircleMarkers( ~Longitude, ~Latitude, popup=~Stratum, radius=~Size, color="red", fillOpacity=0.75, data = coords, map = map) 
```


In the information that pops out of the marker, you can put a bit more information that may be helpful. Here I estimate some diversity measures (H<sub>O</sub>, H<sub>E</sub>, and A<sub>E</sub>).

```{r}
x <- genetic_diversity(arapat,stratum = "Population",mode="Ho")
x <- x[ x$Locus == "LTRS",]
coords <- merge( coords, x[,c(1,3)])
x <- genetic_diversity(arapat,stratum = "Population",mode="He")
x <- x[ x$Locus == "LTRS",]
coords <- merge( coords, x[,c(1,3)])
x <- genetic_diversity(arapat,stratum = "Population",mode="Ae")
x <- x[ x$Locus == "LTRS",]
coords <- merge( coords, x[,c(1,3)])
```

Make these data into a table format in html (OK it is a bit ugly but it is reasonable.) 

```{r}
coords$label <- paste("<h2>",coords$Stratum,"</h2><table><tr><th>Parameter</th><th>Value</th></tr><tr><td>H<sub>O</sub></td><td>", format(coords$Ho,digits = 3),"</td></tr><tr><td>H<sub>E</sub></td><td>",format(coords$He, digits=3), "</td></tr><tr><td>A<sub>E</sub></td><td>",format(coords$Ae,digits=3),"</td></tr></table>",sep="")

# plot it
map <- addTiles( leaflet( data=coords) )
addCircleMarkers( ~Longitude, ~Latitude, popup=~label, radius=~Size, color="red", fillOpacity=0.75, map = map, data = coords) 
```


### Sankey Networks

A Sankey Diagram is a kind of flow diagram used to indicate allocation of one set of object into two other sets.  This is a complicated set of interactions to display.  In the following example, I'll use the populations from the *A. attenuata* data set and then map each individual onto its STRUCTURE cluster and mtDNA clade.  This plot needs a `data.frame` for nodes and one for edges created in a particular way.


```{r,warning=FALSE, message=FALSE}
 library(networkD3)
pops <- as.character(unique(arapat$Population))
species <- as.character(unique(arapat$Species))
cluster <- as.character(unique(arapat$Cluster))

sankey.nodes <- data.frame( name = c( pops, species, cluster), 
                            stringsAsFactors = FALSE)


# takes vector of values and finds out the number of times
# each combination is found in the pair.
get_sets <- function( source, target ) {
  t <- table(source, target)
  m <- which( t > 0, arr.ind = TRUE)
  suppressWarnings( df <- data.frame( ind=m ) )
  df$source <- rownames(t)[df$ind.source]
  df$target <- colnames(t)[df$ind.target]
  df$value <- sapply(seq(1,dim(df)[1]), 
                     function(x) {
                         return(t[df$ind.source[x],df$ind.target[x]])
                         })
  return( df[,3:5] )
}

# For finding the number of the name instead of the name
nameIndex <- function(vals,names){
  ret <- sapply( vals, function(x) {which(x==names)-1}, USE.NAMES = FALSE)
  return(ret)
}


sankey.edges <- rbind( get_sets(arapat$Species,arapat$Population),
                       get_sets(arapat$Population, arapat$Cluster) )
sankey.edges$targetNum <- nameIndex(sankey.edges$target, sankey.nodes$name)
sankey.edges$sourceNum <- nameIndex(sankey.edges$source, sankey.nodes$name)
```


What it does produce is a very cool and dynamic plot.  The colored boxes represeent the groups--Species & STRUCTURE Cluster on the outside, population on the inside.  Connecting lines indicate individuals who are in each of these partially overlapping groups. For example, if you grab the population *Mat*, you will see it has samples in it who are identified as either mtDNA type *Peninsual* or *Cape* as well as belonging to STRUCTURE groups *CBP-C* and *SCBP-A*.  If you have difficult data such as these, this kind of plot may be very helpful.

```{r}
sankeyNetwork( Links=sankey.edges, Nodes=sankey.nodes,
               Source = "sourceNum", Target = "targetNum",
               Value = "value", NodeID = "name",
               width = 700, fontSize = 12, nodeWidth = 30)
```


<!--chapter:end:mapping_populations.rmd-->

# Population Models {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_baja_south_of_loreto.jpg"></div>

Subdivision occurs when individuals are spatially, ecologically, or temporally separated causing the relative probability of mating among individuals to be non-uniformly spread across all individuals.  This is the natural state for most species—they are not completely randomly mating—and as such this causes changes in how genetic variation is partitioned.  In this Chapter we examine subdivision and how we quantify its strength.

Before we descend into discussions on models describing population subdivision, we should probably clarify some jargon that is used commonly in describing these systems.  

*Migration:* Migration is the movement of individuals among spatial locations.  It does not, strictly speaking, denote any relation to mating.  Bird, mammals, and even insects migrate in response to annual seasonality, though the process of movement towards the equator or more reasonable habitat for the particular season does not necessarily influence population genetic structure.  

*Gene Flow:* Gene flow is a process of genetic material moving between locales or populations that results in modification of standing genetic variation.  Gene flow is commonly denoted as a rate (e.g., a fraction bound between 0 and 1) and has direct influence on the allelic and/or genotypic composition of the population.

That is not to say that migration may result in subsequent mating, contributing to the population structure, it just does not necessarily require it.  

## Models of Partition

If individuals are partitioned into different groups, the expectations for allele and genotype frequencies depend upon the way in which populations are connected through gene flow.  There are some classic models associated with population connectivity that we will cover in this section, though in reality the ways in which populations are actually connected most likely fall outside the strict conventions for these simple models.  What these models do, however, is provide us a with a framework on which to examine observed distributions of genetic variation and to make predictions on future structure given a few assumptions.

A network analogy will be used here to describe the connectivity, with populations acting as nodes in the network and gene flow being indicated by edges connecting these nodes.  Strictly speaking, this is a weighed graph as the edges have associated with them particular numerical values representing the migration rate between the connected edges.  Each population can have its own allele frequency spectra and through time, allele frequencies can change in response to immigration into the population but is not expected to change due to emigration from it, as we are assuming migration likelihood of any individual is statistically independent of allele frequencies.

## Island Mainland Model

An island-mainland model is the simplest formulation of a population model.  Here, we have a large mainland population and a small island population.  For simplicity, define the allele frequencies of these two as $p_x$ and $p_y$.  During each discrete generation, some fraction $m$ of the individuals from the mainland arrive into the mating population of the individuals on the island.  The composition of the island is composed of the $m$ fraction of individuals that are immigrants and the $(1-m)$ fraction that were already on the island.  

```{r echo=FALSE, fig.cap="Basic mainland-island model of population partitions."}
knitr::include_graphics("media/Island_Mainland.png")
```

At the next generation, the allele frequencies of the mainland population remain the same (we are assuming it is large enough such that a loss of migrants does not influence allelic or genotypic frequencies), whereas the island population is comprised of $m$ percent of immigrants, whose frequencies are $p_x$, and $(1-m)$ fraction of residents whose frequency was in the last generation $p_y$.  Taken together, their frequencies are:

\[
p_{y,t+1} = (1-m)p_{y,t} + mp_{x,t}
\] 

From this formulation, it is easy to deduce that after a sufficient number of generations, the allele frequencies of both mainland and island will be the same.  The island will eventually have the same frequencies as the mainland, though the amount of time it takes depends upon the difference in allele frequencies, $\delta p = p_x - p_y$, and the migration rate, $m$, to the island population.  Since there is no migration from the island to the mainland, the equilibrium frequency will be $p_x$.  Here is a bit of code that shows the effects that different migration rates may have.

```{r}
migration_rates <- c(.01,.05,.10,.15)
results <- data.frame(m=rep(migration_rates,each=100), 
                      Generation=rep(1:100,times=4),
                      p=NA)
for( m in migration_rates) {
  px <- 0
  py <- 1
  results$p[ results$m==m ] <- py
  for( t in 2:100){
    p.0 <- results$p[ results$m==m & results$Generation == (t-1) ]
    p.1 <- (1-m)*p.0 + px*m
    results$p[ results$m==m & results$Generation == t ] <- p.1
  }
}
results$m <- factor(results$m)
```

```{r echo=FALSE, fig.cap="Allele frequencies for and island population fixed for the an allele who is getting migrants from a mainland population at various rates ($m$) that does not have the allele present in the island.", warning=FALSE, message=FALSE}
library(ggplot2)
ggplot(results, aes(x=Generation, y=p, color=m)) + geom_line() + xlab("Generation") + ylab("Island allele frequency (p)")
```

So for even low migration, say $m=0.01$, allele frequencies may change rather quickly due to immigration.

## The Island Model

The next most complex model is one where every population is exchanging migrants.  This n-island model was first introduced by Sewell Wright (1931).  In this one, all populations are connected via a constant migration rate.  An allele that arrises in one population through mutation can potentially be dispersed to any other population in a single generation, the likelihood of which is determined by the migration rate. 


```{r nislandmodel, echo=FALSE, fig.cap="An n-island model with three populations, each with its own allele frequencies, with a constant and symmetric migration rate ($m$)."}
knitr::include_graphics("media/Island_Model.png")
```


The stability of this system is quite high.  All populations are sharing migrants and as a consequence, all populations will thus converge on a unified allele frequency, one defined by the global average allele frequency, $\bar{p}$.  The length of time it takes to get to that equilibrium point is determined by how far away from the global mean the population is and the rate at which migrants are distributed.

There are two ways to model this kind of system, one more simple than the other.  The simplest way is to consider that the migrants are a giant migrant pool and from that they are distributed to the populations.  For example, if $\bar{p}$ is the global average allele frequency, the migrant pool could be considered to also have this allele frequency.  If you believe this is a reasonable approximation, then the allele frequencies at the next generation, say for population $X$ in Figure \@ref(fig:nislandmodel), are:

\[
p_{x,t+1} = (1-m)p_{x,t} + m\bar{p}
\]

In a general context, we can estimate what the allele frequencies will be for an arbitrary time in the future if we know:  
- The starting allele frequencies, $p_0$
- The migration rate, $m_{i,j}$
- How many generations migration has been happening, $t$.

This is estimated, following the a similar format that we found in both mutation and inbreeding, as:

\[
p_t = \bar{p} + (p_0 - \bar{p})(1-m)^t
\]

We can examine the change in allele frequencies through time numerically.  Consider a population that starts out at $p_X = 0.1$ but is receiving migrants at a rate of $m=0.05$ each generation.  With these parameters, we can set up an expectation of allele frequencies for, say, $100$ generations, as:

```{r}
T <- 100
pX <- rep(NA,T)
pX[1] <- 0.1
pbar <- 0.5
m <- 0.05
for( t in 2:T)
  pX[t] <- pbar + (pX[1]-pbar)*(1-m)^t
df <- data.frame( Generation = 1:T, Frequency = pX)
```

```{r echo=FALSE, fig.cap="Change in allele frequency for a population starting out at $p_X=0.1$ and receiving migrants from a global migrant pool of $p_m=0.5$ at a rate of $m=0.05$."}
ggplot(df, aes(x=Generation, y=Frequency)) + geom_line() + ylab("Allele Frequency")
```


Through time, the allele frequencies change systematically, tending towards the global allele frequency defined by all populations.  
Some salient assumptions for this model include:  

- Generations do not overlap so that we can use a difference equation approach for understanding connectivity.
- Populations are discrete in that there are breaks between populations.
- Migration rates are constant through both space and time.
- Migration is symmetric in both directions.

This approach may not be the most realistic one, but it does outline a general way in which we can predict allele frequencies through time.  

## Stepping Stone Models

A slightly more realistic model was introduced by Kimura & Weiss (1964) who coined the term ‘stepping-stone' model to indicate one that takes into consideration the spatial arrangement of populations as it describes connectivity.  

```{r echo=FALSE, fig.cap="A 1-dimensional stepping stone model.  Each population is connected to only two others and exchange migrants at a rate of $\\frac{m}{2}$ in each direction each generation."}
knitr::include_graphics("media/Stepping_Stone_Model.png")
```


Their model consisted of an infinite length of populations, all connected with a unified migration rate, $m$.  For any population along this continuum, the frequency of alleles at the next generation was dependent upon the following:  

- The frequency at the current generation for that population, $p_i$.
- The rate at which migrants were arriving at the population from each side.  The assumed that the total migration rate was $m$ and as such they were receiving migrants at a rate of $\frac{m}{2}$ each generation, from each neighbor.
- Each neighbor may have its own initial allele frequency and would change through time with continued exchange of migrants.
- There is some ‘background' level of migration that is going on, consisting of some rate of migrants, $m_{\infty}$, whose allele frequencies are denoted as $p_{\infty}$.  This part of the equation may be a bit difficult to parameterize (or said another way, it may be a nice ‘buffer' to have in your model to help explain things you can't directly explain).
- Each population has some stochastic change in allele frequencies each generation, denoted as $\eta_i$, which may be due to genetic drift and other local processes.  The expectation for this ‘error' term is $E[\eta_i] = 0$ and its variance is $E[\eta_i^2] = \frac{p_i(1-p_i)}{2N_e}$, which is consistent with what we've seen for drift already.

Taken as a whole, their formulation is:

\[
p_{i,t+1} = (1-m_1-m_\infty)*p_i + \frac{m}{2}(p_{i-1,t} + p_{i+1,t}) + m_\infty\bar{p} + \eta_i
\]

Extending on this basic model, they also derived the expectations for a 2-dimensional connectivity model where populations were arrayed on a grid with rows and columns.  They also briefly entertained a 3-dimensional model as well.

For each of these models, the estimated the expectation of the decrease in genetic correlation between pairs of populations with increases in distance between populations.   Perhaps not surprisingly, the correlation is higher in 1-dimensional than in 2-dimensional arrangements.  It is also higher in 2-dimensional than in 3-dimensional ones.  Their paper is an excellent introduction to the mathematics underlying connectivity in discrete space and should be examined in detail by those who are interested in modeling this in a more complete fashion.  However, for completeness here, we will look to a more general approach to understanding connectivity and changes in allele frequencies based upon a specified ‘network' of connectivity.

## The General Model

A more broad approach that can be applied to all types of connectivity models is one that allows one to specify the underlying connectivity model and then estimate allele frequency changes on that model.  Instead of having a single migration rate for all populations or a rigid arrangement of populations, if you can specify the topology of a connectivity network, you can use the following approach to estimate allele frequency changes.  We will start with the simple case of three populations, each with their own frequencies and connected by individual migration rates.  In this system, we can consider the frequencies for population X as being derived by the frequencies of all populations with which it exchanges migrants as well as their individual rates.  Here is an example from the diagram below.

\[
p_{x,t+1} = m_{x \leftrightarrow y}p_{y,t} + m_{x \leftrightarrow z}p_{z,t} + [1 - (m_{x \leftrightarrow y} + m_{x \leftrightarrow z})]p_x
\]

```{r echo=FALSE, fig.cap="Generic multi-population model for gene flow with individual migration rates."}
knitr::include_graphics("media/General_Model.png")
```

The current frequency is determined by the frequencies of all connected populations (both $Y$ and $Z$ in this case) at the previous generation and their individual migration rates (these are the immigrants) and the residents ($1 - m_{all\;migration\;to\;this\;population}$) and its previous allele frequency. In a similar fashion, the other populations are:

\[
p_{y,t+1} = m_{x \leftrightarrow y}p_{x,t} + m_{y \leftrightarrow z}p_{z,t} + [1 - (m_{x \leftrightarrow y} + m_{y \leftrightarrow z})]p_y
\]

and 

\[
p_{z,t+1} = m_{x \leftrightarrow z}p_{x,t} + m_{y \leftrightarrow z}p_{y,t} + [1 - (m_{x \leftrightarrow z} + m_{y \leftrightarrow z})]p_z
\]

In R, we can iterate through this and see these behaviors.  Here we look at three populations, each starting with different allele frequencies and estimate allele frequencies for a period of T = 75 generations.  

```{r}
T <- 75
pX <- rep(NA,T)
pY <- rep(NA,T)
pZ <- rep(NA,T)
pX[1] <- 0.1
pY[1] <- 0.5
pZ[1] <- 0.9
```

These populations exchange migrants with different basal rates.  For this example, we will assume the exchange of migrants is symmetric, though it is not necessary to do so, and you can see how the following code is extended to account for asymmetry.

```{r}
mXY <- 0.04
mXZ <- 0.02
mYZ <- 0.08
```

Then, the simulation is run across the T generations and at each iteration, the frequencies of each population is updated based upon these migration rates and the frequencies of the populations from which the immigrants come.

```{r}
for( gen in 2:T){
  pX[gen] <- mXY*pY[gen-1] + mXZ*pZ[gen-1] + ( 1 - (mXY+mXZ))*pX[gen-1]
  pY[gen] <- mXY*pX[gen-1] + mYZ*pZ[gen-1] + ( 1 - (mXY+mYZ))*pY[gen-1]
  pZ[gen] <- mXZ*pX[gen-1] + mYZ*pY[gen-1] + ( 1 - (mYZ+mXZ))*pZ[gen-1]
}
df <- data.frame( Generation=rep(1:T,times=3))
df$Frequency=c(pX,pY,pZ)
df$Population <- rep(c("X","Y","Z"), each=T)
```

Through time, we see that the allele frequencies all converge on the global allele frequency

```{r}
mean( c(pX,pY,pZ))
```


```{r, echo=FALSE, fig.cap="Change in allele frequencies through time for three populations connected by differential rates of gene flow ($m_{XY} = 0.04$, $m_{XZ}=0.02$, $m_{YZ}=0.08$)."}
ggplot(df, aes(x=Generation, y=Frequency, color=Population)) + geom_line() + ylab("Population Allele Frequency")
```


There are a couple of points here to be made.  

- The rate at which allele frequencies change depends upon the $\delta p = p_i - \bar{p}$ and the migration rate.  The approach to $\bar{p}$ was faster for population $Z$ than population $X$ as the migration rates to and from $X$ are overall lower than from $Z$.  
- Some situations, depending upon the configuration of the connectivity ‘network,' can result in reversals of direction in allele frequency change. Look at the population $Y$. Its allele frequencies started out at $\bar{p}$ and initially increased towards those of population $Z$.  This is because the relative rate of migration between $Y$ & $Z$ is greater than between $Y$ & $X$.

These observations suggest that when you are doing simulations, you need to wait a bit to allow the dynamics of the system to ‘burn in' a bit.  The equilibrium point we are interested in seeing is only attainable after the entire system has gone through many generations and the idiosyncrasies have been thoroughly iterated through.    

## General Strategies

In dealing with these kinds of dynamical systems, there are some important things that you need to consider when attempting to understand the underlying dynamics.  These are general rules, though there are always exceptions.  However, if you follow them, you will probably have a much easier time with the calculations.  

1. Draw out the network if you can.  If it is too complicated, then at least try to capture the main flows through the network, graphically.  Visualization of network structure is very helpful for understanding where you think things should be going.
2. In most systems, allele frequencies will tend towards $\bar{p}$ for all populations given enough time.  In cases where populations are isolated (e.g., $m=0$), or migration is unidirectional (as in the mainland/island model), you may have components of the overall topology that will never stabilize on $\bar{p}$.  
3. Once specified, iterate through each generation slowly, checking to see if you are parameterizing the system correctly.  There are an infinite number of ways for you to not get it correct via a typo or other programatic error.  Always ‘sanity check' your results.

With these strategies in mind, you should be able to attack any population arrangement.

<!--chapter:end:population_models.rmd-->

# Genetic Diversity {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_milkweedbug.jpg"></div>

At a base level, genetic diversity is the fundamental components upon which evolution operates. Without diversity, there is no evolution and as such species cannot respond to selective pressure. Genetic diversity is a property of sampling locales.  It is created and maintained by demographic and evolutionary processes and the history of the organisms being examined.  It is also used as a surrogate measure for the consequences of several microevolutionary processes. In this section, we will examine how to estimate genetic diversity within a sample of individuals.

Estimates of within genetic diversity depend solely upon what you consider 'within a group.'  Often we use terms like Population, Deme, etc., but these have specific evolutionary and/or demographic meanings.  We are, however, largely ignorant if the samples we have collected are technically a part of a 'Population' in an evolutionary or at least practical random mating context.  As such, I will use the term population loosely, indicating that it is a collection of individuals sample from a geographic locale.  I am implicitly assuming a functional definition here (as I do in my research) that individuals sampled from the same 'Population' have a much higher probability of mating together than individuals sampled from different 'Populations.  This is 'Population Genetics' after all...

Here we will use the *Araptus attenuatus* co-dominant locus dataset that is included with the gstudio library.

```{r message=FALSE, warning=FALSE}
library(gstudio)
data(arapat)
```

Now the data is loaded into our session and we can extract the names of the loci (using `column_class()`, a convenience function returning the names of columns of a particular type). 

```{r}
locus_names <- column_class( arapat, "locus")
locus_names
```

Genetic diversity is estimated in R using the function `genetic_diversity()` contained within the gstudio library.  Here is the documentation for this function.  We will walk through the various parameters and illustrate their use with this dataset.

<div style="height:250px;white-space:pre-wrap;overflow:scroll;padding:8px;background:#eee;outline: 1px #eee;font-size:small"><pre>
genetic_diversity {gstudio} 						R Documentation

Estimate genetic diversity

Description

This function is the main one used for estimating genetic diversity among strata. Given the large number of genetic diversity metrics, not all potential types are included.

Usage
genetic_diversity(x, stratum = NULL, mode = c("A", "Ae", "A95", "He", "Ho",
  "Fis", "Pe")[2])

Arguments

x - A data.frame object with locus columns.
stratum - The strata by which the genetic distances are estimated. This can be an optional parameter when estimating distance measures calculated among individuals (default='Population').
mode - The particular genetic diversity metric that you are going to use. The gstudio package currently includes the following individual distance measures:
	A   	Number of alleles
	Ae	Effective number of alleles (default)
	A95	Number of alleles with frequency at least five percent
	He	Expected heterozygosity
	Ho	Observed heterozygosity
	Fis	Wright's Inbreeding coefficient (size corrected).
	Pe	Locus polymorphic index.

Value

A data.frame with columns for strata, diversity (mode), and potentially P(mode=0).

Author(s)

Rodney J. Dyer rjdyer@vcu.edu

Examples

AA <- locus( c("A","A") )
AB <- locus( c("A","B") )
BB <- locus( c("B","B") )
locus <- c(AA,AA,AA,AA,BB,BB,BB,AB,AB,AA)
locus2 <- c(AB,BB,AA,BB,BB,AB,AB,AA,AA,BB)
Population <- c(rep("Pop-A",5),rep("Pop-B",5))
df <- data.frame( Population, TPI=locus, PGM=locus2 )
genetic_diversity( df, mode="Ae")
</pre></div>


## Allelic Diversity

At the base level, a collection of alleles in a dataset provides an estimate of the breadth of genetic variation present. There are many measures commonly used in population genetics, though $A$, $A_{95}$, and $A_e$ are quite commonly encountered.  These terms are defined as:

*Frequency Independent Diversity* - The parameter $A$ measures the raw count of the number of alleles present in the sample.  This is a frequency independent estimate of diversity as it does not matter if the allele is only seen once or is seen hundreds of times in the sample.  

Technically, we define this parameter as  

\[
A = \ell
\]

where $\ell$ is the number of observed alleles.  We can estimate allelic diversity by passing the function either a vector of locus objects

```{r}
genetic_diversity( arapat$MP20, mode="A" )
```

or a `data.frame` that has one or more columns of data that are locus objects as determined by the `column_class()` function.  

```{r}
genetic_diversity( arapat, mode="A" )
```

The concern associated with using a parameter such as $A$ is that it throws away any information about relative abundance.  For example, a dataset with alleles found at an abundance of $N_A = 999$ and $N_B = 1$ would have the exact same frequency independent diversity as another sample with $N_A = N_B = 500$, namely $A = 2$.  

*Frequency Dependent Diversity* - A slightly more stringent approach may be to estimate the diversity of alleles in a frequency dependent fashion, specifying a particular cutoff.  The parameter $A_{95}$ is a common parameter that provides a count of the number of alleles present in a sample whose frequencies exceed a threshold of 5%.  

\[
A_{95} = \left| A_{freq(\ell)>0.05} \right|
\]

This threshold is entirely arbitrary in the same way that we use $\alpha=0.05$ as a cutoff in the teaching of formalized hypothesis testing.  It could just as easily be 3%, 1%, or 10% but has a historical trajectory  in its current form.  The $A_{95}$ parameter does remove some of the problems associated with rare alleles in our data sets as is shown by comparing the diversity of the most allele rich locus MP20 using $A$

```{r}
mp20.A <- genetic_diversity( arapat$MP20, mode="A")
mp20.A
```

and $A_{95}$

```{r}
mp20.A95 <- genetic_diversity( arapat$MP20, mode="A95")
mp20.A95
```

Here 15 of the alleles we observed in the full dataset occurred at a frequency less than 5%!

*Effective Number of Alleles* - Given the arbitrary nature of the cutoff for $A_{95}$, a more parsimonious approach may by one that measures the contribution of each allele to diversity in proportion to its frequency.  This is $A_e$ and it is defined as:

\[
A_e = \frac{1}{\sum_{i=1}^\ell p_i^2}
\]

and is the default mode of diversity provided by the `genetic_diversity()` function (e.g., the one you get if you do not specify `mode=`).

```{r}
genetic_diversity( arapat, mode="Ae")
```

Notice that $A \ge A_e \ge A_{95}$, as $A_e$ integrates the contributions of those low frequency alleles.  None of these parameters are perfect, though all provide some measurement of diversity at the level of the allele.

## Genotypic Diversity

In addition to diversity at the level of alleles, we also can estimate diversity at the level of the genotype.  There are several parameters that can be derived for genotypic diversity but the most common are observed and expected heterozygosity.  Observed heterozygosity is simply the fraction of all genotypes in the sample that are not homozygotes.

\[
H_O = \frac{N_{ij}}{N}
\]

where $N_{ij}$ is the count of all heterozygous genotypes.  The `mode="Ho"` option in `genetic_diversity()` returns these values. 

```{r}
genetic_diversity(arapat, mode="Ho")
```

The second parameter, expected heterozygosity ($H_e$), is slightly different in that it is a parameter that needs to be estimated from data.  It represents the fraction of genotypes in the dataset that are expected to be heterozygous, if the sample is in equilibrium.  The value for this function depends upon the data that you are estimating and how it is configured.  If it is estimated from a single locale, then the parameter can be estimated directly as the expected fraction that are not homozygous.

\[
H_E = 1 - \sum_{i=1}^\ell p_i^2
\]

(where $\ell$ is the number of alleles at the locus).  Here is an example of expected heterozygosity for the LTRS and MP20 loci.

```{r}
genetic_diversity( arapat$LTRS , mode="he")
genetic_diversity( arapat$MP20, mode="he" ) 
```

In an ideal world, we would be able to have enough resources to sample for these parameters with so many individuals, they would be well behaved.  Unfortunately, we do not live in the real world and in many cases our ability to sample lots of individuals is limited.  This problem is not a new thing, we estimate the sums of square and divide by $N-1$ instead of $N$ because if we we use the later, we are producing a biased estimate.  Unless you are working with *all* the samples, every time we make population genetic inferences we will be doing so on a subsample of the total data available.  The `arapat` data set is no where near a major component of the species, it is only a sample.  From this sample, we *estimate* allele frequencies on which we derive our *estimates* of parameters such as $H_E$.  As such, if we used the formulas above to estimate everything in every study, we would have biased estimators of... pretty much everything.  However, the fix is not one that is too difficult to put in, and it has a pretty well established set of behaviors.  As such, if we are estimating these parameters in a single locale (for more than one location, we have further corrections to make), we should probably apply a small sample size correction.  In general, this correction is:

\[
\hat{H}_e = \frac{2N}{2N-1}\left[ 1-\sum_{i=1}^\ell p_i^2 \right]
\]

The front part is a small sample size correction factor.  Its importance in your estimation diminishes as $N \to \infty$ as shown in Figure \@ref(fig:biasCorrection).  Once you get above $N=9$ individuals, there is an inflation of the estimated heterozygosity at a magnitude of less than 5%. 

```{r biasCorrection, echo=FALSE, fig.cap="Magnitude of the correction factor for small sample size estimations as a function of N.", message=FALSE}
 library(ggplot2)
N <- 1:100
CF <- (2*N) / ( 2*N-1)
df <- data.frame( N, CF)
ggplot( df, aes(N,CF) ) + geom_line()
```


## Samples from Several Locales 

In addition to problems associated with estimating allele freqeuncies incorrectly (and requiring a sample size correction), when we estimate data from several locations, we also have a problem assocaited with the subset of locales relative to the total popualtion size, resulting in a furhter correction to account of the several biased samples you are taking.

\[
H_S = \frac{\tilde{N}}{\tilde{N}-1}\left[ 1 - \sum_{i=1}^\ell \frac{p_i^2}{K} - \frac{H_O}{2\tilde{N}} \right]
\]

where $\tilde{N}$ is the harmonic mean number of genotypes sampled across each of the $K$ strata.  Notice here I use the term $H_S$ instead of $H_E$ so that there isn't any doubt about the differences when we write and talk about these parameters.  This formulation (after Nei 1987) corrects for the sampling across separate  locations.  To indicate that your estimate is being made using subdivided groups of samples, pass the `stratum=` parameter to `genetic_diversity()` and set `mode="Hes".  We can see the magnitude of the correction by looking at a single population and comparing the estimates of $H_S$ for corrected and non-corrected parameters.

```{r}
pops <- arapat[ arapat$Population %in% c("32","101","102"),]
he <- genetic_diversity( pops, mode="He")
hes <- genetic_diversity(pops, stratum="Population", mode="Hes")[1:8,]
df <- data.frame(Locus=he[,1], He=he[,2], Hes=hes[,2])
df
```

```{r echo=FALSE}
df <- df[ !is.na(df$Hes),]
p <- format(round(cor(df$He,df$Hes),2),nsmall=3)
```


They are pretty close (Pearson's $\rho =$ `r p`), even when there are only `r nrow(pops)` individuals in the sample.^[The `NA` in the $H_{ES}$ parameter is because there are no smaples in one of the populations and a harmonic mean with a zero in it results in a divide-by-zero error.]  But they are off and this is a vitally important distinction because if you do not account for these differences you will percolate these errors up through your subsequent analyses (and this is a bad thing).

## Multilocus Diversity

There are several measures of individual locus diversity but few for multilocus diversity.  One potential measure for diversity across loci is to based upon the fraction of population that has unique multilocus genotypes.  This is defined as:

\[
D_m = \frac{N_{unique}}{N}
\]

and can be estimated using the function `mulitlocus_diversity()`.  Looking across the putative species indicated in the data set, we can see that in general, the Cape populations are much less diverse than those individuals samples throughout Baja California.

```{r}
multilocus_diversity( arapat[ arapat$Species=="Cape",])
multilocus_diversity( arapat[ arapat$Species=="Mainland",] )
multilocus_diversity( arapat[ arapat$Species=="Peninsula",] )
```

This is a pretty crude measurement but later when we examine models based upon conditional multilocus genetic distances, we need to make sure that the samples are both allelic rich and multilocus diverse and this approach is a nice way to do that.


<!--chapter:end:genetic_diversity.rmd-->

# Rarefaction {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_pepsis.jpg"></div>

The primary reason for looking at diversity is to perform some comparison, which provides some insights into the biological and/or demographic processes influencing your data.  Without a basis for comparison, diversity estimates are just numbers. However, deriving an estimate of diversity is a statistical sampling process and as such we must be aware of the consequences our sampling regime has on the interpretation of the data.  This is where rarefaction comes in, a technique commonly used in ecology when comparing species richness among groups.

Here is an example of the problem sampling may interject into your analyses.   Consider a single locus with four alleles.

```{r message=FALSE, warning=FALSE}
library(gstudio)
data(arapat)
f <- data.frame( Allele=LETTERS[1:4], Frequency=0.25)
f
```

Selected as a random sample from an infinite population. The first sample has 5 individuals.

```{r}
pop1 <- make_population( f,N=5)
ae1 <- genetic_diversity( pop1, mode="Ae" )
ae1
```

And the second one has 100 individuals.

```{r}
pop2 <- make_population( f, N=100 )
ae2 <- genetic_diversity( pop2 )
ae2
```

The difference in estimated diversity among these groups are `abs( ae1 - ae2 ) = 0.15`.  Is this statistically different or are they the same?  Is it just because we sampled more individuals in the second set that we get higher values of $A_e$? Consider the `MP20` locus in the beetle data set, it has a total of 19 alleles present.  If we subsample this data and estimate the number of observed alleles, we see that there is an asymptotic relationship between sampling effort and estimates of allelic diversity.  Here is the code for estimating frequency independent diversity, $A$, using these data.

```{r}
loci <- arapat$MP20
sz <- c(2,5,10,15,20,50,100)
sample_sizes <- rep( sz, each=20 )
Ae <- rep(NA,length(sample_sizes))
for( i in 1:length(Ae)){
  loci <- sample( loci, size=length(loci), replace=FALSE)
  Ae[i] <- genetic_diversity( loci[1:sample_sizes[i]], mode="A" )
}
```

The ‘curvy' nature of this relationship shows a few things.  

It takes a moderate sample size to capture the main set of alleles in the data set.  If we are looking at allocating sampling using only 5 individuals per locale, then we are not going to get the majority of the alleles present.

For the rare alleles, you really need to grab large at-site samples if estimates of diversity are the main component of what you are doing.  Do rare alleles aid in uncovering the biological processes you are interested in studying?  They may or may not.

For most purposes, we will use all the samples we have collected.  In many cases though, some locales may not have as many samples as other ones.  So, even with these data, if I have one locale with 10 samples and another with 50, how can I determine if the differences observed are due to true differences in the underlying diversity and which are from my sampling?  Just as in testing for HWE, we can use our new friend permutation to address the differences.

Rarefaction is the process of subsampling a larger dataset in smaller chunks such that we can estimate diversity among groups using the same number of individuals.  Here is an example in the beetle dataset where I am going to look at differences in diversity among samples ($N = 75$) collected in the cape regions of Baja California 

```{r}
ae.cape <- genetic_diversity( arapat[ arapat$Species=="Cape", "WNT"] )
ae.cape
```

and compare those to the genetic diversity observed from a smaller collection of individuals sampled from mainland Mexico ($N=36$).

```{r}
ae.mainland <- genetic_diversity( arapat[ arapat$Species=="Mainland", "WNT"] )
ae.mainland
```

The observed difference in effective allelic diversity, $A_{e,mainland} == A_{e,cape}$, could be because the Cape region of Baja California is more diverse or it could be because there are twice as many individuals in that sample.   

To perform a rarefaction on these data, we do the following:

Use the size of the smallest population ($N$) as the sample size for all estimates.
Randomly sample individuals, without replacement, from the larger dataset in allocations of size $N$.

Estimate diversity parameters on these subsamples and repeat to create a 'null distribution' of estimated diversity values.

Compare your observed value in the smallest population to that distribution created by subsampling the larger population.

From the data set, this is done by

```{r}
cape.pop <- arapat[ arapat$Species=="Cape","WNT"]
null.ae <- rarefaction( cape.pop, mode="Ae",size=36)
mean(null.ae)
```

So even if the samples sizes are the same, the mean level of diversity remains relatively constant.  The range in diversity

```{r}
range(null.ae)
```

is quite large.  Since this estimate is frequency based, random samples of alleles change the underlying estimate of Ae during each permutation.  

The observed estimate of diversity in the Mainland populations does fall within this range.  However, the null hypothesis states that $A_{e,mainland} = A_{e,cape}$ and if this is true, once we standardize sample size, we can take the distribution of permuted Ae values as a statement about what we should see if the null hypothesis were true.  As such, we can treat it probabilistically and estimate the probability that $A_e$, Mainland is drawn from this distribution.

```{r}
null.ae <- c( null.ae, ae.mainland[1,1])
P <- sum( ae.mainland <= null.ae ) / ( length(null.ae) )
P
```

Or graphically, it can be depicted as below.

## Mapping Diversity

Estimating diversity is great and being able to compare two or more groups for their levels of diversity is even better.  But often we are looking for spatial patterns in our data.  Both R and gstudio provide easy interfaces for plotting data and later in the text we will see how to integrate raster and vector data into our workflows for more sublime approaches to characterizing population genetic processes.  In the mean time, it is amazingly easy to use basic plotting commands to get pretty informative output.  In this example, I extract the mean coordinate of each stratum in the arapat dataset and then estimate diversity at the level of these partitions and merge the diversity estimates for the AML locus into the coordinate data.frame.  

```{r}
library(gstudio)
data(arapat)
diversity <- genetic_diversity(arapat, stratum="Population", mode="Ae")
coords <- strata_coordinates(arapat)
coords <- merge( coords, diversity[ diversity$Locus == "AML", ] )
```

Then, I grab a map from the Google server and map my populations with diversity depicted as differences in the size of the points.  Note that the `ggmap()` function provides the base map that is retrieved but when we use `geom_point()` we need to specify the `aes()` and the `data=` part as these data are from the `data.frame` we made, not from the map we grabbed from Google.

```{r}
library(ggmap)
map <- population_map(coords)
ggmap(map) + geom_point( aes(x=Longitude,y=Latitude,size=Ae), data=coords)
```


<!--chapter:end:rarefaction.rmd-->

# Genetic Distances {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_northbaja.jpg"></div>

One can measure the relative distance between items based upon some basic assumptions.  In Euclidean geometry, the underlying distance measures are based upon the triangle inequality.  The same kinds of approaches are available in characterizing genetic separation, either among individuals or among locales.  This chapter introduces some of the methodologies used in estimating genetic distances, how we analyze them, and how we portray our results in a graphical fashion amenable for interpretation.

Underlying most estimates of genetic distance are sets of geometric and/or evolutionary assumptions.  There are many different metrics that could be estimated from our genetic data, many of which will tell us the same general story.  However, there are specific features of our data that may make one kind of metric more appropriate than another type.

There are two main characteristics that control our ability to map collections of genotypes onto a real number line; Self identity and Symmetry.  Self identity implies that the distance, $\delta$, between an object (a single genotype or a collection of genotypes in a population) and itself must be zero: $\delta_{ii}=0$.  You are absolutely identical to yourself.  Symmetry implies that the distance between objects is independent of the order in which we measure it; $\delta_{ij} = \delta_{ji}$.  This can be a more problematic assumption in some cases and when we examine estimation of ecological distances later, we will see this may not be the case.  

If these two conditions are satisfied, the distance may be considered *metric* if (and only if) it satisfies the triangle inequality, $\delta_{ij} \le \delta_{ik} + \delta_{kj}$.  We've seen this relationship in Euclidean Geometry, the squared length of the hypotenuse is equal to the sum of the squared length of both legs in a triangle; $z^2 = x^2 + y^2$.

Genetic distance itself is not the final product that allows us to gain inferences about underlying population genetic processes.  Rather, it is the input into subsequent analyses.  This section focuses first on how we estimate distances, depending upon the level of interest, and then highlights some approaches for visualizing and gaining inferences from the distance matrix itself.  In later chapters, we return to the use of distance matrices when we examine network models and analyses based upon the idea of isolation models commonly used in modern population genetic analyses.

## Individual Genetic Distance

Genetic distance can be estimated based upon differences at many levels, the most basal of which is among individuals.  In the discussion of the individual genetic distances, the following genotypes will be used for illustrative purposes.

```{r message=FALSE}
 library(gstudio)
AA <- locus(c("A", "A"))
AB <- locus(c("A", "B"))
BB <- locus(c("B", "B"))
BC <- locus(c("B", "C"))
AC <- locus(c("A", "C"))
CC <- locus(c("C", "C"))
Locus <- c(AA, AB, AC, BB, BC, CC)
Locus
```

### Euclidean Distance

The most basic distance metric we can use would be Euclidean.  Intuitively, we have a good idea about how this metric works as we use it every day.  Euclidean distance is estimated as:

\[
\delta_{eucl} = \sqrt{\sum_{i=1}^\ell (p_{xi}-p_{yi})^2}
\]

where $p_{x•}$ is a vector of allele frequencies measured on the $x^{th}$ individual. We can create these by translating the raw genotypes into multivariate data as we in the previous section on individual and population assignment.

```{r}
mv.genos <- to_mv(Locus)
mv.genos
```

and then using the `dist()` function that R has built-in to estimate $\delta_{eucl}$

```{r}
D.euc <- dist( mv.genos )
D.euc
```

The output variable, `D.euc`, is of type

```{r}
class(D.euc)
```

which is definitely not a matrix object. It is a specific data type that is used for dealing with distance matrices.  There are some analyses (variants of AMOVA and the Mantel test come to mind) that require the input matrix to be of type dist rather than of type matrix and don't do the conversion for you directly.

Because most distance matrices have a zero diagonal ($1^{st}$ requirement outlined previously) and symmetry (the upper diagonal is equal to the lower diagonal, the $2^{nd}$ requirement), it is possible to save computer memory by only dealing with the data below the diagonal.  Some analyses, such as an implementation of the Mantel test, require the use of dist objects instead of matrices.  

It is easy to convert between the two object types as necessary.

```{r}
D.euc <- as.matrix( D.euc )
D.euc
```

if you need to perform operations that are not available to dist objects (like matrix algebra).

### AMOVA Distance

Another geometric interpretation of inter-individual distance is that of Excoffier *et al.* (2004), dubbed AMOVA distance.  Smouse & Peakall (1999) provided the geometry of this relationship for diploid loci when discussing spatial autocorrelation.  AMOVA distance can be visualized as the distance between vertices and mid-points on an equilateral triangle.  Each segment is defined as having unit length and the distance is taken as the square of the distance between genotypes. 

```{r echo=FALSE}
knitr::include_graphics("media/AMOVA_Distance.png")
```

In R, we calculate it as:

```{r}
D.amova <- genetic_distance( Locus, mode="AMOVA" )
rownames(D.amova) <- colnames(D.amova) <- as.character( Locus )
D.amova
```

For completeness, the previous figure should be drawn as a tetrahedron with the possibility of four alleles at a locus (allowing for distances between two heterozygotes that do not share any alleles).  However, that is not an easy thing to draw for me...  The only interesting distance you'll need to think about (and perhaps grab a piece of paper and draw some triangles) is the distance between two two heterozygotes that share no alleles.  I'll leave it up to you to figure out that one (or use `genetic_distance()` for it).  

### Bray-Curtis Individual Distance

A distance metric that can be used to test differences among individuals or populations is that of Bray-Curtis.  This distance is a transformation of Jaccard's distance (see below) and is derived from an ecological ordination paper by Bray & Curtis (1957).

\[
D_C = 1 - 2\frac{\sum_{i=1}^\ell min(p_{X,i},p_{Y,i})}{\sum_{i=1}^\ell p_{X,i} + \sum_{i=1}^\ell p_{Y,i}}
\]

This parameter is bound by 0 on the lower end, indicating that the two items being compared are identical and 1 at the upper end indicating complete dissimilarity.  In R, this distance is defined in the vegan package for normal vegetation analysis and in gstudio for genetic data.  There is some confusion in the literature as to how this distance metric should be calculated and it is implied by Yoshioka (2008) that at least some of the implementations are actually Czekanowski distance.  At the time of this writing, there are some discrepancies between the distances that are calculated in different programs, whose names will be withdrawn to protect the innocent, so be aware.  In the end, it is not fealty to a particular distance metric that is important in our analyses, it is the ability of some metric to describe the variation we see.  

The implementation in gstudio uses allele counts across all loci.

### Ladder Distance

For microsatellite loci, which have a specific step-wise mutation model, a genetic distance metric can be created based upon the number of repeat motif differences at genotypes rather than just binary same/different alleles (as in the AMOVA distance before).  If the stepwise mutation process at microsatellite loci has played a significant role in shaping the structure on the landscape then we expect the kinds of spatial signal that is present in a ladder distance approach should differ from an approach that does not take into consideration the unique characteristics of this kind of locus.  If they say provide very similar kinds of inferences then it is more likely that mutation, as a process, has not shaped the history to a large extent.


## Population-Level Genetic Distances

If we are collecting data in such a way as to have pre-defined (or determined ex post facto) groups (populations or locales), several additional distance metrics can be derived.  Below are a few of the more common ones that you will run into.  It should be said that there is no dearth of genetic distance metrics available.

### Czekanowski (Manhattan) Distance

Perhaps the simplest distance metric is that of Czekanowski—more commonly referred to as Manhattan distance.  It is called Manhattan distance because it is a distance metric that is analagous to how you measure city-block distances.  It is relatively difficult in a city to walk from one place to another in a straight line, rather we are forced to walk down this street, turn walk down that one, etc.  

```{r echo=FALSE}
knitr::include_graphics("media/Chezkanowski_Distance.png")
```


Population allele frequencies can be plot on the linear axis $\sum_{i=1}^\ell p_i = 1$ and the distance between populations is the sum of the lengths separating each population on each of the $\ell$ axis.  This is calculated as:

\[
D_{Cz} = \frac{1}{2}\left| p_{i,x} - p_{i,y} \right|
\]

where the pipes, $|x|$, indicate taking the absolute value.  

### Rogers (Euclidean) Distance

Stepping up, slightly, in complexity, we get to Roger's distance.  This distance metric is roughly equivalent to Euclidean distance (and indeed in gstudio it is called euclidean).  

```{r echo=FALSE}
knitr::include_graphics("media/Rodgers_Distance.png")
```


Here we consider the allele frequencies as before but instead of taking the absolute values of the distances along each axis, we take the straight-line distance.   This is estimated as:

\[
D_R = \sqrt{ \frac{1}{2} \sum_{i=1}^\ell\left( p_{i,x} - p_{i,y}\right)^2 }
\]

This is a more common distance metric than that of Czkanowski and has been used in classification and taxonomy.  There are a few drawbacks to this distance metric that stem from not being based upon population genetic theory.  These may, or may not, be relevant to what you are doing with your data and where your data is from.  

1. This distance is not proportional to time of separation of populations.  There is no drift component to it.  
2. This distance is not proportional to the number of base-pair substitutions, which may be important if you are looking at haplotype divergence.  
3. This distance may loose sensitivity with increased allelic diversity.

If you are using this metric at short time intervals, have loci of moderately diversity, or are not concerned about homology, this is a quick metric.

### Nei's Genetic Distance

One of the most common distance metrics used is that of Nei, and is estimated as:

\[
I = \frac{\sum_{i=1}^L\sum_{j=1}^{\ell_{i}} p_{ij,x}p_{ij,y}}{\sqrt{\sum_{i=1}^L\left(\sum_{j=1}^{\ell_i} p_{ij,x}^2\right)\sum_{i=1}^L\left(\sum_{j=1}^{\ell_i} p_{ij,y}^2\right)}} 
\]

but more commonly referred to using the transform:

\[
D_{Nei} = -\ln(I)
\]

This metric is roughly linear in time, assuming a drift-mutation equilibrium.  Populations with longer periods of separation will have larger values of $D_{Nei}$.  If you think that drift and mutation are significant features that have shaped the divergence of your populations, then this is an appropriate metric to use.

### Conditional Genetic Distance

There is a final genetic distance that should be discussed here, Conditional Genetic Distance, derived from a network abstraction of genetic covariance by Dyer & Nason (2004).  However, both the rationale and the approach that produce these measures of distance rely on some approaches that need further development and will be put off until a more complete treatment can be done in the chapter on population graphs.

<!--chapter:end:genetic_distance.rmd-->

# Genetic Structure {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_xian_kan.jpg"></div>

The term 'structure' is used in many different ways in population genetics.  Essentially, data has genetic structure if there is a non-random association of genotypes across your sampling locations.  The interaction of population genetic processes create the genetic structure we see, though it is not easy to infer historical process from the presence of structure alone.  For this we need to rely upon more cleaver experimental design.  In this chapter we examine parameters designed to describe the amount of structure present in a dataset and how we can derive some level of inference from their magnitudes.

Structure in populations influences the distribution (spatial, ecological, and temporal) distribution of of alleles.  Until this point, we have focused on how alleles coalesce into genotypes at the individual level, local allele frequencies and the Hardy Weinberg expansion provide an expectation for the probability of seeing a heterozygote, for example.  As we scale up from the individual itself to the deme (or sub-population) in which it resides, we can also estimate heterozygosity.  As we coalesce these demes into populations, regions, continents, etc. we can also estimate expectations for heterozygosity at these levels as well. The deviance from these expectations at all levels can be captured in a statistic—the $F$-statistic we've seen already is on of them—useful for comparisons among studies, sites, etc. The $F$-statistics are perhaps the most misused parameters in population genetics but they have both a long history of use and pretty robust expectations with respect to how they will respond to various evolutionary processes.

In this section, we will explore genetic structure, sensu lato, providing examples of single and multilocus statistics and how we currently think they should be used.  

## F-Statistics and Thier Ilk

The most commonly used parameter for in population genetics are derived from Wright's $F$-Statistics.  We have already encountered one of them, the inbreeding parameter F, which we calculated using the `genetic_diversity()` function using `mode="Fis"`.  This parameter compares the level of observed heterozygosity to that expected under HWE as:

\[
F_{IS} = 1 - \frac{H_O}{H_E}
\]

If the locale in question is producing heterozygotes at a rate consistent with HWE then there is no 'inbreeding' (as we defined it previously).  This being said, the expected frequency of heterozygotes can be estimated at several different demographic levels if you have data that is subdivided.  Assuming you have individuals, sampled from K different locales, you can estimate heterozygosity at the following levels:  

- Individual heterozygosity ($H_I$):  This is what we've been calling HO, observed heterozygosity and is estimated as the fraction of heterozygotes in your sample for a particular locale, $H_I = \frac{N_{ij}}{N}$.  This value can be interpreted as the “average heterozygosity of all genes in an individual" or as “the probability of observing a heterozygote at a particular locus."  
- Subpopulation heterozygosity ($H_S$):  This is the rate of heterozygosity expected at a particular sampling sub-population (or what I refer to in this text as a locale).  This is estimated as the average in expected heterozygosity ($H_E$) across all $K$ sampling locales; $H_S = \frac{1}{K}\sum_{i=1}^K 2p_iq_i$.  It assumes that each of your sampling locales is in HWE, each conforming to the host of assumptions necessary to satisfy that condition.  
- Total heterozygosity ($H_T$):  This is the total heterozygosity across all the data.  This is the expected heterozygosity if all the data were merged and mating as a single panmictic population.  This parameter is estimated as $H_T = 2\bar{p}\bar{q}$, where the frequencies for each allele are determined as the average across all sampling locales.

If the underlying mating patterns among subdivided populations restrict *gene flow*, then these estimates of heterozygosity will change. At one extreme, if all populations are mixing freely and it is essentially a single panmictic population then heterozygosity at the sub-population and total population levels will be equal, $H_T=H_S$, independent of the amount of inbreeding (or consanguineous mating that is actually occurring).  At the other extreme, if all locales are genetically isolated from each other, then each will be diverging its own evolutionary trajectory both sub-population and total estimates of heterozygosity will diverge.  In fact, it is exactly these relationships between estimates of heterozygosity that Sewell Wright used to derive the oft-misused $F$-Statistics.  

To date, we've already used one of these parameters, when examining inbreeding.  Our $F$-statistic is the first demographic level.  Using this new terminology for heterozygosity, the inbreeding statistic is defined as: 

\begin{aligned}
F_{IS} &= \frac{H_S - H_I}{H_S} \\
& = 1 - \frac{H_I}{H_S}
\end{aligned}

where the subscripts on the parameters reveal the sampling level being used.  In this case, the subscripts on $F_{IS}$ stand for “inbreeding of _I_ndividuals relative to the _S_ubpopulation they are sampled."  The values of this parameter are positive when we see less heterozygotes than expected, and can be negative when we see more heterozygotes than expected under HWE.  This parameter makes particular sense in light of unknown subdivision in populations, the Wahlund Effect discussed earlier.  

At the next level, we can examine inbreeding of the Subpopulation relative to the Total data set, $F_{ST}$.  This is defined as:

\begin{aligned}
F_{ST} &= \frac{H_T - H_S}{H_T} \\
&= 1 - \frac{H_S}{H_T}
\end{aligned}

This parameter has been so misused in the literature, it is almost a caricature of itself.  As we see from the formula, it is the reduction in heterozygosity of the subpopulation relative to the entire data set.  This parameter can be either positive or negative.  Wright's original formulation of the parameter $F_{ST}$ was as:

\begin{aligned}
F_{ST} & = \frac{\sum_{i=1}^\ell \sigma_{q_{S(i)}}^2}{\sum_{i=1}^\ell \left[ q_{T(i)}\left( 1 - q_{T(i)}\right) \right]} \\
& = 1 - \frac{y_{ST}}{y_T}
\end{aligned}

Where $\sum_{i=1}^\ell \sigma_{q_{S(i)}}^2$ is the variance in allele frequencies across subpopulations (the $S(i)$ subscript) using all $\ell$ alleles at the locus, and $q_{T(i)}$ is the average allele frequency across all subpopulations.  The parameters $y_{ST}$ and $y_T$ are equivalent to $H_S$ (average expected heterozygosity across subpopulations) and $H_T$ (expected heterozygosity of average allele frequencies) respectively. 

Lets take a simple example and work though this for clarity as the estimators $H_S$ and $H_T$ can be a bit confusing at times.  Consider the case where we have three populations assayed for a single bi-allelic locus.  Allele frequencies at each of the populations are:

```{r}
p <- c(0.2, 0.3, 0.4)
q <- 1-p
```

which give an expected heterozygosity of:

```{r}
hs <- 2*p*q
hs
```

Whose average is:

```{r}
mean(hs)
```

For $H_T$, we estimate the average of the allele frequencies ($\bar{p}$ and $\bar{q}$) and then the expectation for heterozygosity ($2\bar{p}\bar{q}$) as:

```{r}
pbar <- mean(p)
qbar <- mean(q)
ht <- 2*pbar*qbar
ht
```

From these values, we can estimate $F_{ST}$ as:

\begin{aligned}
F_{ST}  & = 1 - \frac{H_S}{H_T} \\
&= 1 - \frac{0.4067}{0.42} \\
& \approx 0.317
\end{aligned}

Values of $F_{ST}$ that are close to zero indicate low levels of differentiation, whereas those close to unity represent complete differentiation (with very important caveats that follow).

### Issues with Fixation Indices

One could interpret the original configuration of $F_{ST}$ as being a ratio of variances.  The numerator is the variance in allele frequencies across sampled locations and the denominator is essentially the variance of binomial (or multinomial if $\ell > 2$).

The largest issues with this parameter is that it is NOT a measure of genetic differentiation in the sense that it tells us how different populations are (e.g., how we would use this term in the common vernacular).  In fact, Sewell Wright (1984) specifically states that

$F_{ST}$ can be interpreted as a measure of the amount of differentiation among subpopulations, relative to the limiting amount under complete fixation...

This can be seen in the following examples and the values we get for the parameter $F_{ST}$ in each.  

*Scenario 1:* Two populations, `A` and `B`, each fixed for a different allele. In this case, the heterozygosity at population `A` would be $2p_Aq_A = 2*0*1 = 0$, likewise for population `B` at $2p_Bq_B = 2*1*0 = 0$, and the estimate of subpopulation heterozygosity is $H_S = \frac{0+0}{2} = 0$.  Total heterozygosity, $H_T$, is defined using allele frequencies averaged across populations and would be $H_T = 2\bar{p}\bar{q} = 2*0.5*0.5 = 0.5$ making $F_{ST} = 1 - \frac{0}{0.5} = 1.0$.  These two populations are diverged completely from each other and this makes sense.  Subpopulation heterozygosity, $H_S$, is *always* zero when any number of populations are fixed for a single allele.

*Scenario 2:* Two populations with two alleles each, though not the same alleles.  In the most simple case lest assume population `A` is in HWE with alleles `A` and `B` occurring at equal frequencies and the other population has alleles `C` and `D` also at equal frequencies.  In this case, heterozygosity at each population would be $H_{S,A} = H_{S,B} = 0.5$ and $H_S = 0.5$ as it is the average of the expected population-level heterozygosity.  The total expected heterozygosity is the heterozygosity of allele frequencies averaged across populations, which in this simple example we have $\bar{p}_A = \bar{p}_B = \bar{p}_C = \bar{p}_D$  and $H_T = 1 - \sum_{i=1}^\ell p_i^2 = 0.75$^[When we have more than 2 alleles at a locus, it is often easier to estimate the part that is not homozygote rather than adding up for example $2pq + 2pr + 2ps + 2qr + 2qs + 2rs$ to estimate heterozygotes.]  This makes $F_{ST} = \frac{H_T - H_S}{H_T} = \frac{0.75 - 0.25}{0.75} = 0.33$.  Intuitively, this does not make much sense, why would it be a third if *both* populations are in HWE, just for different alleles?

*Scenario 3:* Three populations, the first of which is fixed for allele `A` and the rest that are fixed for allele `B`.  In this scenario, $H_S = \frac{1}{3}\left[ 2p_1q_1 + 2p_2q_2 + 2p_3q_3 \right] = 0$ but $H_T$, being defined as the expected heterozygosity of averaged allele frequencies, $\bar{p} = \frac{1+0+0}{3} = 0.33$ and $\bar{q} = \frac{0+1+1}{3} = 0.66$ would be $H_T = 2\bar{p}\bar{q} = 2*0.33*0.66 = 0.4356$ and $F_{ST} = \frac{0.4356 - 0}{0.4356} = 1$.  Again, when populations are fixed for different alleles, $F_{ST} = 1.0$.  However, in this case, two of our populations are entirely identical!  How is it that the two populations with identical allele frequencies (and whose own $F_{ST} = 0$ by the way) can cause $F_{ST}$ to go to unity?  In fact, we could have 100 populations fixed for one allele and 1 population fixed for the other and still have $F_{ST} = 1.0$!

This is because, as Wright (1984, pg 82) pointed out:

>> The fixation index is thus not a measure of degree of differentiation in the sense implied by the extreme case by absence of any common allele.  It measures differentiation within the total array in the sense of the extent to which the process of fixation has gone towards completion.

This is not how it is used in the literature, where it is often used to describe the differences among populations in an absolute sense.  This is why these are called fixation indices (the $F$ is for fixation).

For completeness, we can also estimate the inbreeding of an individual relative to the total population, 

\[
F_{IT} = \frac{H_T - H_I}{H_T}
\]

though it is not often used because individuals are inbred relative to the populations within which they are found, not within which the entire dataset is composed.  Moreover, the three parameters have the following relationship, 

\[
(1-F_{IS})(1-F_{ST}) = (1-F_{IT})
\]

which means that once $F_{IS}$ and $F_{ST}$ are estimated, $F_{IT}$ is completely defined.  

Before we jump into some data, we need to address one more issue that we are confronted with.

*Scenario 4:* Estimation of the population-level heterozygosity is not done without error.  In fact, we are estimating these parameters for each of the populations and every time we need to determine estimates of allele frequencies.  As such, if we are going to take this into consideration, we need to correct estimates of heterozygosity accordingly.

Here is an example of the differences we will see if we do not account for these problems (as well as the issue of samples sizes and sampling locations discussed in \@ref(samples-from-several-locales)).  In the arapat data set, I'm going to use the two of the mainland populations and estimate $F_{ST}$ from them for a single locus as an example to demonstrate this last issue and show the magnitude of the bias that may be introduced by not considering that each of the stratum have estimated heterozygosity with a bit of error.

```{r message=FALSE, warning=FALSE}
 library(gstudio)
data(arapat)
df <- arapat[ arapat$Population %in% c("101","102"), c(3,13)]
df <- droplevels( df )
```

Here is what that looks like (the `droplevels()` bit is to remove the non-observed populations from the stratum column in the derived `data.frame`).

```{r}
df
```

If you look at the alleles present by locus, you can see quite that the allele frequencies (above) are not that close to each other—in fact, population 101 has the 04 allele that is not present in population 102.  Numerically, they are:

```{r}
frequencies(df, stratum = "Population")
```

If we estimate heterozygosity directly for each population we see that the individual, population-level, heterozygosity, is estimated as:

```{r}
x <- c(genetic_diversity(df[df$Population=="101",],mode="He")$He, genetic_diversity(df[df$Population=="102",],mode="He")$He)
x
```

which results in an estimate of [~] of:

```{r}
hs <- mean(x)
hs
```

Similarly, the parameter [~] is:

```{r}
ht <- genetic_diversity(df,mode="He")$He
ht
```

We can estimate [~] as:

```{r}
Fst_biased <- 1 - hs/ht
Fst_biased
```

If we do take into consideration the error associated with estimating these parameters, we find a much smaller value:

```{r}
#Fst(df)
```

Notice here that [~] is estimated from averaging (the first way) 10% lower than done when considering sampling allocations.  The bias associated with [~] is much smaller but exists all the same.  Overall, the problem here is an overestimation of [~] by a factor of almost 3!  These results are exaggerated a bit because of the small size—it would be foolish to estimate [~] from only N=17 individuals from two populations.  However, it does show the importance of considering sampling allocations.  As you sample more individuals, these estimates of heterozygosity will converge.

### Additional $X_{ST}$-like Parameters

The use (and perhaps misuse) of [~] has supported the development of almost a cottage industry in other parameters, each trying to fit into a specific perceived problem in the original parameter.  Here are some other extensions of this basic parameter that you may come across:  

- $R_{ST}$ for Microsatellite loci.  This uses the ladder genetic distance metric discussed previously.  
- [~] for Nucleotides variation.  
- [~] for population subdivision (and in some cases mutation).  
- [~] for loci with high allelic diversity (along with its nemesis [~]).

Each of these parameters attempts to solve an additional problem that people have posed for [~].  The magnitude of bias associated with each may be differential, depending upon the extent to which your data are violating underlying assumptions.  For example, at highly diverse loci, the expectation for [~] is limited in that it cannot reach its theoretical maximum.  As such, Hedrick (ref) has derived the [~] parameter that corrects for this.  In general, the degree to which [~] and [~] diverge depend upon both the diversity and the evenness of allele frequencies in your populations.  This begs the question, “Is there a cutoff when I should not longer use [~] and prefer [~] instead?"  I think Hedrick would suggest to always use it the secondary one, just in case, though it may be in your best interest to examine the differences in the parameters in your own data first.

Here is an example of how these parameters may give alternative estimates.  I'll use the MP20 locus as it is a microsatellite locus of high diversity and compare the [~], [~], and [~] estimates.

```{r}
df <- arapat[, c(3,14)]
genetic_structure(df,stratum="Population",mode="Gst")
genetic_structure(df,stratum="Population",mode="Gst_prime")
genetic_structure(df,stratum="Population",mode="Dest")
```

Is the diversity at this locus influencing structure estimation?  Yes, you can clearly see that [~] alone produces a much smaller value than the other parameters.  Which of the other ones are more accurate?  That is an interpretive question you have to answer knowing your study system.

## Statistical Structure

Thus far, we've examined parameters that describe genetic structure, in relation to fixation.  These paremeters are all based upon underlying population genetic assumptions.  However, that need not be the case.  In 1984, Wier & Cockerham derived a parameter [~], which was an estimator for [~] but based upon an analysis of variance approach rather than expectations relying upon heterozygosity. This statistic is essentially a variance ratio, derived from a random-effects analysis of variance model, just like we would use in normal parametric statistics.  Later, Excoffier et al. (1992) expanded upon this approach to provide a multilocus estimate of this parameter, which they called [~] using an analysis they termed AMOVA (for Analysis of MOlecular VAriance).  In 2004, Dyer et al. showed how both of these approaches are just special cases of a more general family of Q-mode linear regression models, amenable to broad ranges of sampling designs and configurations.  In what follows, I'm going to focus on the AMOVA configuration, mostly for completeness in relation to the individual AMVOA distance described previously (reproduced on the right).  The geometry of this encoding provides an easy heuristic for quantifying genetic distances between pairs of individuals.

Taken as a whole, the distance among all N individuals can be represented as a pairwise genetic distance matrix, [~].  This matrix has the form:

[~]

which is symmetric ([~]) and has zero squared distance down the diagonal ([~]).  This can be evaluated as an Analysis of Variance by decomposing the squared distance values into the Total Sums of Squared Distances

[~], 

the Sums of Squared Distances Within strata

[~]

and the Sums of Squared Distances Among populations.

[~]

The notation is a bit odd here, but it is essentially all the squared distances between individuals found in different populations.

You can visualize these parameters in matrix form as depicted below. The pairwise distance matrix has individuals in it sorted in order of appearance within populations.  As such, [~] is the sum of all squared distances and is decomposed to the additive components of [~] (summing distances within populations), and [~] (summing values representing among stratum distances). For each of these parameters is standardized by the number of entries in that particular group.

These sums of squares are easily put into a traditional AMOVA table.  From here, it is identical to a normal ANOVA analysis.  We have degrees of freedom, mean squares, and variance components.  The AMOVA itself, due to the way we sample the individuals and populations is a random-effects model.  Our sampling typically consists of us going out and sampling a portion of the potential populations that exist rather than sampling all specified populations.  As a statistical consequence, this means that the variance within (the error term) and among (the treatment variance) have to be corrected because we are taking only a sample from all potential populations.  For a 1-level analysis, we can perform this in R by first setting up the data as a distance matrix and a vector of population assignments (it appears that the values need to be either numeric or as a factor for populations, though at the time of this writing, I could not get a factor representation to work properly).

```{r}
D <- genetic_distance(arapat, stratum="Population", mode="AMOVA")
D <- as.dist( D )
pop <- as.numeric( arapat$Population )
```

and then using the `amova()` function from the pegas library (there are many different implementations of this kind of analysis, decompose the variance into and fill out the AMOVA table as:

```{r}
 library(pegas)
fit <- amova( D ~ pop, nperm=1000 )
fit
```

This gives us the standard anova table and an estimate of the probability associated with the variance component representing among-population variance, [~].  
The estimated variance components of this model, we can estimate a structure statistic, [~], which for a 1-level analysis is

[~]

and from the output of the amova() function can be estimated as (don't know why this isn't done automagically)

```{r}
PhiST <- 1 - fit$varcoef/(fit$varcoef+fit$varcomp[2,1])
PhiST
```

Which means that roughly `r print(PhiST*100,digits=3)`% of the genetic variation observed in these data can be attributed to individuals being assigned to different populations.

## Final Thoughts on Structure

The estimation of genetic structure is a fundamental endeavor in population genetics that at times can be given a bit more weight than it may warrant.  These parameters are simply estimates of a magnitude of structure in relation to either the degree to which populations have gone to fixation or as a statistical decomposition of raw variation.  The important component here is that there are a lot of ways that a set of population histories may result in the same amount of genetic variation.  The important point here is that simply looking at the magnitude of variation among strata does not allow us to differentiate among alternative demographic histories.  


<!--chapter:end:genetic_structure.rmd-->

# Population Assignment  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_salamander.jpg"></div>

With population subdivision, individual population allele spectra may diverge if there is not panmictic connectivity.  Along with this differentiation, it is possible to begin estimating the origin of individual samples.  We can also use models of admixture to understand how both individuals and populations are connected

The easiest way to assign individuals to particular populations is to estimate the probability of each multi locus genotype coming from each population and then compare the relative likelihoods of each, assigning the individual to either the most likely population or at least assigning it, with probability equal to its likelihood, to two or more populations.  The probability of a particular genotype coming from a specific population is dependent upon the allele frequencies at that population as well as any deviations that population may have from Hardy-Weinberg Equilibrium.  Specifically, these expectations from the chapter on inbreeding are:

$$E[AA] = p2(1-F)+pF$$

$$E[AB] = 2pq(1-F)$$

and

$$E[BB] = q2(1-F)+qF$$

Across loci, the probability of a multi-locus genotype ($X$) is multiplicative across all $L$ loci.

\[
P(X|F) = \prod_{i=1}^L P_{x_i}
\]

As a simple example, consider the following code, where we make three populations differing in allele frequencies at a single locus.  

```{r}
freqs <- data.frame( Stratum=rep(c("Population A","Population B","Population C"),each=2), Locus="Locus1")
freqs$Allele <- rep( c("A","B"), times=3)
freqs$Frequency <- c( 0.1, 0.9, 0.4,0.6, 0.7, 0.3)
freqs
```

The structure of this data.frame is exactly like what you would get if you were passing a real dataset to the `frequencies()` function.  Next, the three possible genotypes are created on which we will estimate the assignment probabilities.

```{r message=FALSE, warning=FALSE}
library(gstudio)
loci <- c( locus(c("A","A")), locus(c("A","B")), locus(c("B","B")) )
individuals <- data.frame( ID=1:3, Locus1=loci)
individuals
```

Before we jump in, it should be noted that we are going to estimate the posterior log likelihood of assignment for each genotype.  What this means is that we are going to estimate the probability the genotype came from each population.  These probabilities may be large or small depending upon the allele frequency spectra of each population.  Some can even be zero, for example, if the population does not have a copy of a particular allele.  However, we are interested in understanding the relative likelihood of assignment for each genotype, not the raw probability itself.  In forensic genetics, the multi locus probability of anyone genotype is exceedingly small (by design, that is how they 'get the perp'), but it is the relative likelihood of each multi locus probability that is most important.  

As an example, consider the case where we have 99 populations that have alleles $A$ and $B$ and 1 population with alleles $A$, $B$, and $C$.  In this last population, the frequency of the $C$ allele is $p_C = 0.001$.  A $CC$ homozygote individual would have zero probability of occurring in the 99 populations (they just don't have that allele) and probability of $(0.001)^2 = 1e-6$ in the last population.  Even thought his is a small probability, it is infinitely more probable than 0!  So our likelihood estimate is based upon the probability of assignment to a particular population, $P_i$, scaled by the sum of all potential populations it could be assigned to.  This is estimated as:

\[
\lambda_i = \frac{P_i}{\sum_{j=1}^K P_j}
\]

where $P_x$ is the multi locus probability of assigning to each population.  As such, our $CC$ genotype would have a $\lambda_i = 0\; (i \in 1, 2, \ldots, 99)$ but a $\lambda_{100} = 1.0$ for the last population.  This is the best-case scenario---the assignment process has eliminated all populations except for one.

In R, multi locus probabilities and posterior assignments are made using the `multilocus_assignment()` function from gstudio.  It needs to have an individual that has at least one locus objects in it, typically as a row from a `data.frame` holding all your data, and a frequency matrix as outline above.  Here are examples for the three potential genotypes and the three example populations.


```{r}
multilocus_assignment(individuals[1,],freqs)
```

The $AA$ genotype has a probability of $p^2 = 0.7^2 = 0.49$ of being from stratum $C$ (as expected under Hardy-Equilibrium).  However, in comparison to all populations, the posterior likelihood of the assignment is $\lambda_C = 0.74$.  Biologically, this helps to provide confidence in the assignment by comparing this result to the likelihood of being assigned to $A$ or $B$.  The heterozygote genotype is more likely to be from the stratum with intermediate allele frequencies, $B$, but is not too terribly more likely than being in the $C$ population as well.


```{r}
multilocus_assignment(individuals[2,],freqs)
```

Here is the other homozygote for completeness, having the greatest (by over double) likelihood of being from $A$ than $B$.

```{r}
multilocus_assignment(individuals[3,],freqs)
```

Increasing the likelihood of assignment can be achieved by adding more loci to the analysis, using loci with more alleles, and/or using loci whose allele frequencies are closer to each other (e.g., for a locus with $\ell$ alleles, the most power is achieved if the frequency of each is roughly $\ell^{-1}$).

If you have information on the inbreeding status, $F$, of the dataset, you can include that as an optional parameter in the `mutlilocus_assignment()` function.

<!--chapter:end:population_assignment.rmd-->

# Admixture {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_palm.jpg"></div>

Admixture is the mixing of two or more populations whose frequency spectra may differ.  If they differ considerably, then the admixed population may be easily examined to determine relative contributions of each population.  In a previous chapter, we covered a phenomenon called the Wahlund Effect (Chapter \@ref(wahlund-effects)).  In that situation, we inadvertently mixed two populations, both of which were in Hardy-Weinberg Equilibrium, creating a new synthetic population consisting of individuals from two functionally separate populations. If this synthetic group were to mix, their offspring would contain mixtures of chromosomes from both ancestral populations, as depicted in Figure \@ref(fig:admixture).

```{r admixture, echo=FALSE, fig.cap="Hypothetical pedigree of an individual whose parents come from two separate populations.  Stretches of chromosome are depicted in differing colors to highlight parentage as well as demonstrate admixture in the offspring."}
knitr::include_graphics("media/Admixture.png")
```


By far, the most work on admixture has been done in human populations.  A recent paper by Hellenthall *et al.* (2014) introduced an online [Genetic Atlas of Human Admixture History](http://admixturemap.paintmychromosomes.com/) whereby you can examine how the extent to which populations at particular geographic locations are admixed by various populations and the date (or at least a range of dates with a confidence interval around them) are provided for when admixture had occurred.  You should go play with this Atlas, it is worth the time to see how it works and what kinds of inferences can be gained from it.  In our preferred data set, arapat, we can see that there is the potential for quite a bit of differences in allele frequencies between the large regional 'Species' groups.  

An ubiquitous feature of modern population genetic analyses is the presence of a model-based clustering analysis such as those provided by programs such as STRUCTURE and TESS.  Here, they assume a particular population genetic model and estimate admixture and its inverse—the number of groups present in your data that are not admixed.  If you have a reason to believe that that there are divisions in your data that separate groups of samples into discrete partitions, this is a great way to evaluate underlying statistical support based upon your observed data.  

The general idea here is the current distribution of genotypes ($X$) within individuals and populations is due to historical mixing of populations ($Z$), at specified mixing rates ($Q$), whose allele frequency spectra ($P$) may differ.  

Problematically, we can write this as:

\[
P(X|Z,P,Q) = \prod_{i=1}^N\prod_{j=1}^LP_{z_{ij},\;p_j,\;x_{ij}}
\]

Estimating the particular values $Z$, $P$, and $Q$ (realized for each of the $N$ individuals across $L$ loci), has to be done numerically. There are an infinite number of combinations for these parameters that have an exceedingly low probability of producing a set of genotypes like we see in our data sets.  However, it is possible to have a set of parameters that produced data much like what we see.  If we tentatively accept these parameters as being representative of the admixture history of our data, we can then quantify the degree of admixture in our own individuals to these hypothetical historical populations.  This assignment process can provide some valuable insights to partitioning in the data, presumably due to ongoing or historical processes, particularly if the consequence of these processes result in defined ‘edges' or partitions in your data.  If you data are distributed along a gradient, these approaches are not quite as powerful.

In this example, I'll use the more common software STRUCTURE to demonstrate how this technique is used.  There is an R interface to this software that you can use to run batch jobs of these analyses which I'll show at the end.  There are other programs available but this is the most simple implementation. 

To use STRUCTURE, you need to reformat your data. Population designations cannot be anything other than numeric values, alleles have to be encoded in a particular way, files need to be formatted in a particular way, etc.  Total pain.  Fortunately, we can do all this as:

```{r eval=FALSE}
write_population(arapat,file="arapat_structure.txt",mode="structure", stratum="Population")
```

which will take your R data.frame with loci and other information in it and save it as a text file in the appropriate directory in the STRUCTURE format.  

You run this program using either the command-line approach or via a GUI that comes with it.  Most people will prefer the GUI.  After starting it, do the following:

Create a folder for the project. For some reason the Java GUI that STRUCTURE uses disallows the creation of new folders on OSX (don't know if it does the same on windows). Put your data file (the `arapat_structure.txt` file) in that folder. 

Create a new project in STRUCTURE using $File \to New\;Project$:

Walk through the setup pages. The salient numerical information you will need is:  
- Number of Individuals: 363 
- Number of loci: 8
- Missing data value: -9

You should also check the following boxes:  
- Individual ID for each individual 
- Putative population origin for each individual

This should have you set up and ready to go. For an initial run through the data, you need to specify some parameters for the simulations. You do this by selecting $Parameter\;Set \to New$. 

For demonstration purposes, I'll run it kind of low since we do not want to wait around all day. The default values on the parameters are a good place to start, though you have to put in the Length of the burn in and the number of reps; 100,000 & 10,000 should get you started so that you see how the program runs. You can name this parameter set whatever you like.

To make an actual run, you need to hit the run button (it has the exclamation mark on it). It will ask you how many groups are you going to run, input 3 and let it go. It will chug along for a while, dumping out some output on the bottom of the interface. When finished it will give you an output of the run showing the parameter set. 

On the left, select the Results folder and then the $K=3$ option to see the specifics of that particular run.  Salient information on the output includes:

- The number of individuals per cluster before analyses (by population assignment),  
- The allele frequency differences per cluster,  
- The probability of the data give K=3, and  
- The allele frequencies of each cluster

Perhaps more interpretive are the plots of individuals as they were assigned to each group.  You can visualize this by selecting the Bar Plot menu at the top of the Simulation Results pane. 

If you select it to Sort by $Q$ it should look something like Figure below.

This figure is based upon 363 columns of admixture information, one column for each individual.  The colors are defined by the number of groups, here $K=3$.  You can see some individual columns (=individuals genotypes) who are entirely one color.  These are individuals whose genotypes suggest they are the most likely from that colored group.  You can also see that there are some individuals who may be admixed between two or even three groups ad indicated by a column with more than one color.

```{r structureByQ, echo=FALSE, fig.cap="Output from STRUCTURE using the *Araptus attenuatus* when setting $K=3$."}
knitr::include_graphics("media/structure_by_q.png")
```

This output is a hypothesis about admixture.  As such, it is only one simulation and as we've done many times thus far, we should probably run several of these to generate a level of confidence for any value we specified as $K$.  Here is where it gets a little dicey.  We specified $K=3$ and as such we found out what the $P(X|Z,P,Q)$ by specifying $K=3$ implicitly.  In the output of our data, we see can find the log likelihood of our data given these parameters.  However, $K$ may be some value other than three.  Running this simulation with specified values only tells us the likelihood of the data for that value it does not determine if the specified $K$ is the correct one.  

Inferences on which $K$ is actually correct can only be made by running the program for several iterations for each value of $K$ (to understand the variability in the simulations for that particular value) and running several different values for $K$ itself so we can compare the probability of the data we observed for different numbers of clusters.

The data below depict some runs of the data for $K$ assuming values ranging from 2 to 6.  These are the raw values for the probability of the data for specific values of $K$.  If you are doing this on your data for real, you should do more than three runs, but this provides a base approach for understanding the output.


```{r message=FALSE, warning=FALSE, echo=FALSE, fig.cap="Probability of the observed data for the numbers of clusters ($K$) ranging from 2 to 6 using STRUCTURE.  A total of 10 replicates were run for each level of $K$."}
library(ggplot2)
K <- c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)
LData <- c( -4885.4, -4882.3,-4892.8, 
            -4256.6,-4261.3,-4367.9, 
            -3659.7, -3657.1, -3660.4, 
            -3605.2, -3594.0,-3607.3,
            -3308.7,-3492.7,-3493.7)
df <- data.frame( K=factor(K), Prob=LData )
ggplot(df, aes(x=K,y=Prob)) + geom_boxplot() + xlab("Number of Clusters (K)") + ylab("Ln Probability of the Data")
```

The output shows that as we assume more clusters, the $log(P(Data))$ tends towards an asymptote.  There are a couple things to look at here for this output, and this is where the interpretive power of you the researcher needs to step in. The overall notion among many is that the way in which the probability of the data changes with an increasing number of clusters should be informative as to the ‘correct' number of clusters found. This may or may not be true depending upon your data, your sampling, and the actual history of your organism. However, if it were true then the rationale suggests that when the probability of the data levels off, that may be a good place to look at other sources of inference to see if this may be supported as a plausible number of clusters. In the Figure, this looks like $K=4$ may be a good place to start.

If I go back to STRUCTURE and plot examine the barplot of individuals from the $K=4$ data (setting "Group by PopID"), you see something like Figure \@ref(fig:structureByPop).

```{r structureByPop, echo=FALSE, fig.cap="STRUCTURE output for *Araptus attenuatus* for $K=4$ sorted by population."}
knitr::include_graphics("media/structure_by_pop.png")
```


Here they are sorted into populations and colored by group (green, yellow, blue, red).  From this display, we can make a few inferences: 

There is a pretty good indication that some of the ‘populations' that I sampled appear to be filled with individuals of a single definite type.

If I look at the groupings as it is presented, the 1^^st^^, 2^^nd^^, and 21^^st^^ populations mainly consist of a single type (the green type). If I look at the original data set that we put out, these correspond to populations 101, 102, and 32. These are the three mainland populations that have consistently been found to be different.

The rest are populations in Peninsular Baja.  As we saw in the hierarchical clustering example, these populations may be further subdivided into different groups, Cape vs. the rest and perhaps more nested structure therein.  In actuality, if you pull out the mtDNA types for Peninsula only and run STRUCTURE on it, you will find there is most likely a three separate groupings in the data (this is where the Cluster column in the arapat data set comes from).

Determining the proper $K$ is not a trivial thing.  From the output of estimates of $P(X|Z,P,Q)$ given a particular $K$, we can examine how our data probability changes with different values of $K$.  We can also examine the allocation of individuals into groups and see if it makes sense, biologically.  

One way that is commonly used to get an idea of the magnitude of $K$ is by looking at how $P(X|Z,P,Q)$ changes with increasing values of $K$.  This is often referred to as the $\delta K$ approach.  There is an argument to be made that the best assignment of $K$ would be where there is a large change in $\delta K$ followed by a plateau.  This is interpretive and not a definite heuristic.  

```{r deltaK, echo=FALSE, fig.cap="Estimates of $\\delta K$ for the STRUCTURE output run on *Araptus attenuatus*."}
mean.K <- as.numeric(by( df$Prob, df$K, mean))
delta.K <- mean.K[2:5] - mean.K[1:4]
df <- data.frame( K=3:6, deltaK=delta.K)
ggplot(df,aes(x=K,y=deltaK)) + geom_line() + geom_point() + xlab("Number of Clusters (K)") + ylab("Change in Ln Probability of the Data")
```

What we see is that the largest value of $\delta K$ (Figure \@ref(fig:deltaK)) followed by a steep reduction is when $K=4$.  This may indicate to us that as it stands, the totality of the data contains four groups and we can map them back onto our data set and evaluate the biological support for these groupings.

The arapat data set itself has a column in it for Cluster, which is derived from an approach such as this.  What we found is that the Cape & Mainland populations were definitely separated from the Peninsular ones.  However, when we examined only the peninsular populations, we found roughly three groups contained (nested) within that main group.  One of those groups was so prominent as to be able to be identified in this analysis causing $K=4$ instead of $K=3$ (Mainland, Cape, Peninsula).  However, the distinctness of the Cape and Mainland groups hid the further sub-structuring in the Peninsula populations.  This is an example of how you should perhaps not just grab the first result you get from any analysis, but make sure the totality of your analyses are considered before jumping to biological conclusions

<!--chapter:end:admixture.rmd-->

# (PART) Statistical Models {-}

<!--chapter:end:part_statistical_models.rmd-->

# Ordination of Genetic Data {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_vernal_pool.jpg"></div>

There are several ways that we can look for natural groupings in our data.  Barriers and sources of vicariance across the landscape create discontinuities in the genetic structure.  As we begin to analyze the spatial structure of the data itself, one of the first things that needs to be done is to allow the data to tell the analyst how it is internally structured.  This is a deviation from common approaches in statistical analyses.  Typically, we start with a model and then see the extent to which the data fit into a model framework, potentially testing many different models.  However, just because you get a significant model does not mean it is the best or even the correct one describing the underlying processes.  In Dyer & Nason (2004), we examined the spatial genetic structure of Lophocereus schottii (Cactaceae) and found that while the Sea of Cortéz has acted to separate mainland populations from those on peninsular Baja California, there is a particular population at the very northern extent of the range that is really a peninsular population even though it is spatially on the mainland.  The differences, in aggregate, between peninsula and mainland populations are so large that even if we allocated an individual stratum into the wrong region, we still were able to derive a significant model.  This is where model-free approaches are most powerful, they let the data tell us how it is distributed as opposed to asking if there is enough structure to fit into some pre-conceived model.

A popular way to examine how individuals and populations are configured is to use an ordination approach or apply a multivariate rotation on the data.  In what follows, I'll describe a principal components analysis (though it is not really an analysis, it is more of a projection).  These approaches are common in multivariate statistics and have very little to do with population genetic processes or assumptions.  Fortunately, there are several approaches amenable to transforming population genetic data into a form that is usable in normal multivariate statistics, the one used most often is discretizing the data.  To demonstrate, consider the case where we have four individuals genotyped for a single locus as depicted below.

```{r ordination1, message=FALSE, warning=FALSE}
library(gstudio)
x <- c( locus(c(1,2)), locus( c(1,1) ), locus(c(1,3)), locus(c(3,3)))
x
```

We can convert these data into a matrix of data, where each row in the matrix represents an individual and each column determines the frequency of alleles in that individuals genotype.  In this case, the resulting matrix should have four rows (one for each individual) and three columns (one for each of the three alleles present in the whole data).  This conversion is done using the function `to_mv()`.


```{r}
X <- to_mv(x)
X
```

When using multiple loci, we can essentially bind onto this matrix additional columns of predictor variables.  This matrix can be used as an input to most multivariate analyses.  Depending upon the analysis being used, you may need to modify it a bit.  Many statistical approaches require that you invert the covariance matrix.  For example, the regression coefficients are found as:

\[
\beta = (X'X)^{-1}X'Y
\]

where the $X$ matrix is the predictor data and the $Y$ is the response variables.  That $(\cdot)^{-1}$ exponent represents inversion of the matrix resulting from the multiplication.  The problem with the above [~] matrix derived from the genotypes is that the rows are all linearly related in that they all sum to unity.

```{r}
rowSums(X)
```

which is a problem for matrix inversion as it throws a zero into the mix and dividing by zero is still a problematic endeavor.  To fix this, we need to drop a column from each locus we use to estimate.  This is an optional flag in the `to_mv()` function.

```{r}
X <- to_mv(x,drop.allele = TRUE)
X
```

Once we have the proper data, we can do a principal component (PC) rotation.  A PC rotation is one that takes the original columns of data and performs a rotation on the values to align onto new ‘synthetic' axes.  Consider the example in the next figure.  Here, some bivariate data is plot in 2-space, though this can be done for much higher dimensions of data as well—in fact it is more beneficial with more columns of data and this can be used as a way of reducing the dimensionality of the data while loosing very little (or no) content (more later on this).  The axes of a PC rotation are taken as linear combinations of the existing axes and define a new coordinate set onto which the points are plot.  All points are rigidly constrained to keep the same relationship and there is no loss of information.  The PC axes are defined by determining the most variable stretch through the data. In the figure on the next page, we see the raw data plot onto the X- and Y-axes.  The axis of highest variance does not align with either of the original ones, and instead can be defined as a combination of both X- and Y- coordinates.  If we take the blue axis as the first PC axis, the coordinate of the points would be taken along that new synthetic axis.  The next PC axis is defined as being perpendicular to the previous one(s) and is identified as covering the largest variance in the data as before.  This process continues until there are no more axes.  In our case, the second axis would be at a right angle from the blue line (above).  You can, at maximum, have as many PC axes as there are columns of data.  However, the later axes may not explain any significant chunk of the underlying data, the process of rotating based upon axes of maximal variation may be able to capture the complete dataset with fewer axes than the total set.  This is where a technique like this may be helpful in reducing the dimensionality of the data.

To perform this on genetic data, we first translate the raw genotypes into multivariate data, dropping an allele from each locus.

```{r}
data(arapat)
x <- to_mv(arapat, drop.allele = TRUE)
```

The full arapat data, with 363 individuals has 50 independent allele columns to it, after dropping one for each locus. Passing this to the princomp() function we get the results.


```{r}
fit.pca <- princomp(x, cor = TRUE)
```


Here are the first 8 (out of 50 potential) axes for the arapat data set.

```{r}
summary(fit.pca)
```

This output has two important components to it.  First, it shows the axes, in decreasing order of importance and how much of the total variation they describe.  The first Comp.1 axis explains 14.2% of the variance, the second explains 9.5%, etc.  Second, it shows the cumulative proportion of the variation explained.  From the 50 axes we started with, we can explain 49% of the variance by using just the first eight PC axes.

Where this becomes meaningful for us is in how we can project our original data onto these new coordinate locations and look at the distribution to see if there are any obvious trends, partitions, gradients, etc.  

```{r warning=FALSE, message=FALSE}
 library(ggplot2)
pred <- predict(fit.pca)
df <- data.frame(PC1 = pred[, 1], PC2 = pred[, 2])
df$Species <- arapat$Species
df$Clade <- arapat$Cluster
df$Pop = arapat$Population
ggplot(df) + geom_point(aes(x = PC1, y = PC2, shape = Species, color = Clade), size = 3, alpha = 0.75)
```

We can see from the plot (I've added some designations to the points) that the 363 samples are clustered in an obvious way.  The designation of ‘Species' as depicted by the shape of the points, is defined by the mtDNA clade for each individual, independent of the nuclear marker data we are using here.  Still, it shows a broad separation between the Cape, Mainland, and Peninsula groups.  The colors of the points found within the Peninsula group, come more formal clustering approaches, as defined in the next two sections.
 

## Hierarchical Clustering

In the previous section, we defined a new coordinate space for all individuals in the dataset.  The rotation of the 50 allele encoding columns was able to describe over 95% of the observed variation using only the first 34 PC axes.  In this section, we are going to use the rotated coordinates to evaluate population-level differences using a hierarchical clustering method.  Hierarchical clustering are very helpful in understanding groupings in the data, particularly if there is a ‘nesting' structure.  While there are many ways to do it, they all generally proceed as follows:  
1. Define a numeric metric that measured the distances between all K groups.  
2. Find the two groups that have the smallest distance and coalesce them together into a pair.  
3. Assume that the coalesced pair now constitutes a single entity, estimate the numeric metric among all K-1 groups in the data set.  
4. Go to #2 and repeat until you have coalesced all the groups together.

Here again, it is the data that is telling us how it is structured rather than us imposing a model onto the data to see if it fits.

To do this, the rotated coordinates are used to define the centroid of each population.  Here I use the tapply() function as a short-cut to estimate the mean of each population.  If you are not familiar with this approach, it essentially applies a particular function (in this case taking the mean), to a set of data  in a matrix (the predicted coordinates from the PCA) based upon a set of factor levels (the populations).  This is quite an efficient way to do what would take you a bit of looping to get through.

```{r}
p <- ncol(pred)
pops <- arapat$Population
pop.means <- tapply( pred, list(rep(pops,p),col(pred)),mean)
dim(pop.means)
```

The result is a matrix where each row represents a population and each column represents the mean location of all individuals for each of the 50 PCA axes variables.  These 50-dimensional coordinates can be used to define a pairwise distance metric using the dist() function.  By default, this measures the euclidean distance (e.g., straight-line distance) between each populations 50-dimensional coordinate.

```{r}
pop_pw <- dist(pop.means)
h <- hclust( pop_pw)
plot(h)
```

The plot of this shows the nesting structure of the populations as depicted in the PC-derived data.  

There are some interesting things to notice here.   
- The main separation between mainland populations (32, 101, 102) and peninsular populations (the rest) seems to be the largest difference.  As indicated in the mtDNA (not used for this analysis), there seems to be a clear division between those two groups, consistent with the idea that the Sea of Cortéz has acted as a source of vicariance.
- There seems to be a deep division within the peninsular populations separating out the group of populations including 98, Mat, 157, 73, 75, Aqu, ESan, 156, and 48.  If we look at where these populations are found, we see that they are all located in the southern portion of Baja California—the Cape groupings...
- Then there is the rest of the populations, some of which appear to be partitioned into at least two groups, though there may be more.

This approach is pretty good at pulling apart components within the data set that represent different genetic clusters.

## Interactive Tree Display

Another network approach that could be used is defining bifurcating networks.  Here we will examine the *A. attenuatus* data set using Nei's genetic distance metric.

\[
D_{nei} = -\ln( L )
\]

where

\[
L = \frac{\sum_l\sum_u x_uy_u}{\sqrt{(\sum_l\sum_u x_u^2)(\sum_l\sum_u y_u^2)}}
\]

I make the among stratum distance matrix and then turn it into a `dist` object to do a hierarchal clustering on the populations. I then plot it and make the node colors equal to the grouping of the dendrogram *if* there were 4 groups in the dataset.  This is also zoomable, for large trees, so if you want to figure out the relationships at the leaves, where the internal edges are smaller, zoom in with the mouse.

```{r message=FALSE, error=FALSE, warning=FALSE}
 library(networkD3)
d <- genetic_distance(arapat,stratum = "Population", mode="Nei")
d <- as.dist( d )
hc <- hclust(d)
dendroNetwork(hc, height=600, 
              zoom=TRUE, 
              textColour = c("red","green","orange","blue")[ cutree(hc,4)])
```


<!--chapter:end:ordination.rmd-->

# Distance Matrices {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_araptus.jpg"></div>

> It is often common in population genetics to estimate pairwise distance matrices representing difference in quantities such as spatial location, phenological overlap, genetic differentiation, ecological separation.  Dealing with distances is fraught with problems and continues to be an area of active research.  This chapter focuses on how we can create, manipulate, and analyze distance matrices.


In this section, we examined several different ways in which we can estimate pairwise genetic distances.  In this section, we examine ways in which we can gain some statistical confidence in the way that we can compare the components of genetic matrices.  

In the discussion below, we will be comparing estimates of population-level genetic distance as determined using both Nei’s genetic distance and Euclidean based upon allele frequencies.

```{r}
library(gstudio)
data(arapat)
D1 <- genetic_distance(arapat,stratum = "Population",mode = "Nei")
D2 <- genetic_distance(arapat,stratum = "Population",mode = "Euclidean")
```

## Visual Comparisons

At a first pass, the most parsimonious approach to understanding the relationship between elements of two matrices would be to plot them.  To visualize, we only need the lower (or upper) triangle of each matrix since they are symmetric.


```{r}
library(ggplot2)
df <- data.frame( Nei=D1[lower.tri(D1)], Euclidean=D2[ lower.tri(D2)])
summary(df)
ggplot(df,aes(x=Nei,y=Euclidean)) + geom_point()
```

From this, we can see that both methods produce a set of distances that are largely similar, though not linear.  Remember that Nei’s distance is linear in time only in relation to mutation drift equilibrium.

So lets expand on this output a bit.  As we’ve seen in this dataset, there are at least three different groups in the data that have been designated as different ‘species’ using mtDNA (see Garrick *et al.* 2013 for more on this).  It would be helpful if the way in which we plot these population comparisons would indicate if the comparisons were among the same species groups or among different species groups.  Are the larger estimates of genetic distance more likely between these species groups?  

To do this, we need to collapse the values in arapat$Species by population.  

```{r}
assignments <- by( arapat$Species, arapat$Population, unique)
```

This gives a list of values for each population of the unique entries from `arapat$Species`.  Some populations are completely one type

```{r}
assignments[1]
```

whereas others are mixed between two types of species

```{r}
assignments[6]
```

To capture the different combinations of groupings, we can define a small function that takes the species designations and concatenates them together—instead of having a 2-element vector of "Cape" and "Peninsula" for population 157, we have a "Cape/Peninsula" designation.  To prevent "Cape/Peninsula" and "Peninsula/Cape" from being different just due to the order in which they are found in the data.frame, I sort them first.

```{r}
foo <- function( x ) { return( paste( sort( unique(x) ) , collapse="/") ) }
assignments <- by( arapat$Species, arapat$Population, foo)
assignments[1]
assignments[6]
```

This gives us a vector of species composition equal in length to the number of populations in the data set.  However, the comparisons we are making are not based on the number of populations, they are based upon the pair-wise comparison of populations.  In these data, we have 39 populations but the scatter plot of pairwise comparisons in genetic distance has $\frac{39(39-1)}{2} = 741$ points on it.  As a result, we need to take this vector of assignments and create a pairwise comparison.  We can do this using the `outer()` function, which allows us to take two vectors of equal length and perform some operation on them to produce a square matrix.  This is exactly how vector multiplication works, only instead of multiplying vectors of numbers, we can tell it to paste together the assignments.

```{r}
assignments <- as.character( assignments) 
C <- outer( assignments, assignments, FUN=paste)
C[1:3,1:3]
```

Now we can plot it as normal.

```{r}
df$Species <- C[ lower.tri(C) ]
ggplot(df,aes(x=Nei,y=Euclidean, color=Species)) + geom_point()
```

which produces a somewhat confusing melange of output.
Perhaps all potential combinations provide a little too much information to see overall trends.  Perhaps it is sufficient to partition the points into comparisons among populations that are of the same composition from those that are comparing different species compositions.

```{r fig.cap="Pairwise genetic distance among populations of Arapat attenuatus partitioned by species composition within each population."}
df$Status <- df$Species
df$Status[ df$Species == "Cape Cape"] <- "Same"
df$Status[ df$Species == "Mainland Mainland"] <- "Same"
df$Status[ df$Species == "Peninsula Peninsula"] <- "Same"
df$Status[ df$Status != "Same" ] <- "Mixed"
ggplot(df,aes(x=Nei,y=Euclidean, color=Status)) + geom_point() + facet_grid(Status~.)
```

Given the density of the plots and the relative numbers of them (there are more Mixed than Same) plotting the data like this may not reveal all the underlying trends.  Another way to augment this graphical output would be to add marginal density distributions to the plot.  To do this, we can create a grid of plots using the *ggExtra* `grid.arrange()` function.  To do this, we need to create a few different plots and then put them together into a single output.   A single scatter plot will have both types of points in it.

```{r}
scatter <- ggplot(df,aes(x=Nei,y=Euclidean, color=Status)) + geom_point() + 
  theme(legend.position=c(1,1),legend.justification=c(1,1)) 
```

Then we need to create the density for Nei’s distance to be plot on top of the scatter plot

```{r}
plot_top <- ggplot( df, aes(x=Nei,fill=Status)) + geom_density(alpha=0.5) + theme(legend.position="none")
```

and the corresponding one for the Euclidean distance (notice I’m flipping the coordinates here because I want the density to be along the right side of the graph to mirror the y-axis.

```{r}
plot_side <- ggplot( df, aes(x=Euclidean, fill=Status)) + geom_density(alpha=0.5) + coord_flip() + theme(legend.position="none")
```

I also need an empty plot to since we are going to make a 2x2 grid of plots but the upper right one will have nothing in it (the popgraph library has a nice invisible theme called `theme_empty()`).

```{r message=FALSE}
 library(popgraph)
empty <- ggplot() + theme_empty()
```

Finally, we can stitch this all together and produce a single plot with these four subcomponents.


```{r message=FALSE, warning=FALSE, fig.cap="Pairwise genetic distance measured using Nei's distance and Euclidean distance based upon allele frequencies for populations of Araptus attenuatus.  Each comparison is classified as being between the same species group or between mixed groups and marginal probability densities are shown for each measure."}
 library(gridExtra)
grid.arrange(plot_top, empty, scatter, plot_side, ncol=2, nrow=2, widths=c(4,1), heights=c(1,4))
```

## The Mantel Test

By far, the most common approach for estimating the relative similarity in two matrices is through the use of a Mantel Test.  This test was originally developed, by Nathan Mantel, to estimate the correlation between two matrices, $\bfseries{X}$ and $\bfseries{Y}$.  The idea here is that if there is a positive association between the elements in each matrix, then the large values in $\bfseries{X}$ will be matched up with large values in $\bfseries{Y}$ and the small values in $\bfseries{X}$ will be matched up with the small values in $\bfseries{Y}$.  Conversely, if there is a negative relationship between the two, the large values in $\bfseries{X}$ are matched with the small values in $\bfseries{Y}$ and vice-versa.  As such, we estimate the sum of the multiplication between each of the corresponding elements in these matrices (assuming they are both square and have $N$ rows and columns) as:

\[
Z = \sum_{i=1}^N\sum_{j=1}^N x_{ij}y_{ij}
\]

The expected value of $Z$ is unknown since the forces that shape the relationship between these two matrices is largely unknown.  As a consequence, we cannot look up a value of $Z$ to determine its probability.  However, we can use our old friend permutation to set up some simulations that may help out understanding if the estimated value of $Z$ is small or large, given potential values it could take on with these two matrices.  So what we do is take one of the matrices and shuffle the rows and columns and estimate permuted values of [~] and then compare our observed estimate to these permuted values.  Again, like we did in the section on rarefaction, we take the null hypothesis, $H_O:$ *There is no relationship between elements of these two matrices*, and assume it is true.  If it is in fact true, then any permutation of one of the matrices should produce values of $\tilde{Z}$ as large in magnitude as any other permutation.  It is only where we have significant correlations, either positive or negative, between the matrices that we have values of $\tilde{Z}$ very large or very small.  Therefore, we should be able to define a level of confidence in the magnitude of our observed $Z$ by defining a large set of permuted $\tilde{Z}$ values and comparing them.  This is exactly how it is done^[And coincidently, how much of the statistical testing is done in population genetics since we have no idea what the underlying distributions are expected to be.].  

In a paper by Smouse *et al*. (1986), they showed that this approach is basically just another way of doing a partial regression analysis.  The importance of this observation is that while a particular value of $Z$ may be examined for significance, it is a scaleless entity and varies with the magnitude of the values in the individual matrices.  As such, it is difficult to interpret.  Is $Z=116$ a large value?  However, if interpreted in a regression context, we can estimate scaled parameters such as the partial correlation coefficients.  The math is pretty easy here.  The sums of squares for both $\bfseries{X}$ and $\bfseries{X}$ are found as:

\[
SS_X = \sum_{i=1}^N\sum_{j=1}^N(x_{ij} - \bar{x})^2
\]

and 

\[
SS_X = \sum_{i=1}^N\sum_{j=1}^N(x_{ij} - \bar{x})^2
\]

where $\bar{x} = \frac{\sum_{i=1}^N\sum_{j=1}^N x_{ij}}{N(N-1)}$ and $\bar{y} = \frac{\sum_{i=1}^N\sum_{j=1}^N y_{ij}}{N(N-1)}$, and the sums of cross products are defined as:

\[
SS_{XY} = Z - N\bar{x}\bar{y}
\]

This means that the correlation coefficient is 

\[
\rho = \frac{SS_{XY}}{\sqrt{SS_X}\sqrt{SS_Y}}
\]

which is essentially the Pearson product moment correlation coefficient!  How convenient.  They go on to extend this work to show how if there are more than one ‘predictor’ matrix here, we can also derive the partial correlation coefficients, $\rho_{x_1|x_2}$ (e.g., the coorelation of $x_1$ after fixing the effects of $x_2$) and other useful components.  In R, we have this kind of work done for us rather easily.  There are several packages that provide a Mantel test, we will use the one out of the vegan package. 

In this example, the distance matrices must first be translated into dist objects (a type of symmetrical matrix).

```{r}
 library(vegan)
mantel( as.dist(D1), as.dist(D2), permutations=9999 )
```

From this output, we see that the two distance matrices are highly correlated.  The question that I’ll leave the reader to look at is one based upon the shape of all those pairwise graphs.  Is this a linear relationship?  Is it ok to use a Pearson correlation on these data?  

Recent work by Legendre & Fortin (2010) has suggested that in some situations the use of a Mantel test results in bias assocaited with ascertaining probabilities.  

<!--chapter:end:distance_matrices.rmd-->

# Linear Models {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_milkweed_bug.jpg"></div>

If you boil it down to the base essence of population genetic analyses, we can define almost all of our analyses by the following fundamental equation.  Ultimately, a set of genetic data ($G$) is crated and we are attempting to describe patterns or variation in this data by fitting (via some function $f()$) to it one ore more independent variables ($E$).  

\[
G = f(E) 
\]

These independent variables may be:  

- *Categorical:* Populations, sex, habitat type, etc. This is the most common type of external variable addressed by thousands of hypotheses such as, $H_O:F_{ST}=0$.  More complicated examples may include hierarchical structure such as individuals nested within locales, locales contained within regions, etc.
- *Continuous:* Elevation, Latitude, Longitude, etc.  Here we may be interested in 
- *Mixtures:* Analysis of covariance, or the mixing of categorical and continuous data, has not been used as often as analyses of one type.  The Stepwise Analysis of Molecular Variance (StAMVA) from Dyer *et al.* (2004) provided one methodology for mixing and testing mixed models like this.

We are implicitly assuming that some underlying function, $f()$, that takes the set of predictor variables and relates, or at least is correlated to, what we see in the genetic data.


## Dependent &amp; Independent Relationships

The hypothesized relationship between the two variable sets determines which kind of model to assume.

```{r echo=FALSE}
knitr::include_graphics("media/regression_vs_correlation.png")
```


<!--chapter:end:linear_models.rmd-->

# Distance Networks {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_dogwood2.jpg"></div>

## Graph Structure

```{r}
A <- matrix(0, nrow=5, ncol=5)
A[1,2] <- A[2,3] <- A[1,3] <- A[3,4] <- A[4,5] <- 1
A <- A + t(A)
A
```


There is a quick function, `as.popgraph()` that takes either an existing **igraph**object or a matrix and turns them into *popgraph* objects.


```{r warning=FALSE, error=FALSE, message=FALSE}
 library(igraph)
g <- graph_from_adjacency_matrix( A , mode="undirected")
```

There are several options available under the `mode` parameter.  We typically use the undirected graph option but the following are also available:

1. `undirected` The connections between nodes are symmetric.  This is the default for population graphs as covariance, the quantity the edge is representing is symmetrical.
2. `directed` The edges are asymetric.
3. `max` or `min` Will take the largest (or smallest) value of the matrix (e.g., $max(A[i,j], A[j,i])$ or $min( A[i,j], A[j,i])$ ).
4. `upper` or `lower` Uses either the upper or lower element of the matrix.
5. `plus` Adds upper and lower values (e.g., $A[i,j] + A[j,i]$).

The graph object presents several characteristics as an output including the number of nodes and edges, the mode of the graph, and a list of the edges (if there aren't too many--in which case the list is truncated).

```{r}
g
```

Here we see that this object is an igraph, is 'U'ndirected, and has 5 nodes and 5 edges.  The edges are indicated by numbers and are graphically displayed.


### Node &amp; Edge Attributes

The underlying structure of an `igraph` object allows you to assoicate attributes (e.g., other data) with nodes and edges. Node attributes are accessed using the $V(graph)$ operator (for vertex) and edge attributes are done via $E(graph)$.  Attributes can be set as well as retrieved using the same mechanisms for nodes.

```{r}
V(g)$name <- c("Olympia","Bellingham","St. Louis","Ames","Richmond")
V(g)$group <- c("West","West", "Central","Central","East")
V(g)$color <- "#cca160"
list.vertex.attributes( g )
V(g)$name
```

and for edges

```{r}
E(g)
E(g)$color <- c("red","red", "red", "blue","dark green")
list.edge.attributes( g )
```


## Plotting a Graphs

One of the main benefits to using R is that you can leverage the mutlitude of other packages to visualize and manipulate your data in interesting and informative ways.  Since a `popgraph` is an instance of an *igraph* element, we can use the *igraph* routines for plotting.  Here is an example.

```{r}
plot(g)
```

There are several different options you can use to manipulate the graphical forms.  By default, the plotting routines look for node and edge attributes such as `name` and `color` to plot the output appropriately.  There are several additional plotting functions for plotting *igraph* objects.  Here are some examples.

```{r}
plot(g, edge.color="black", vertex.label.color="darkred", vertex.color="#cccccc", vertex.label.dist=1)
```

Below is the output from the `igraph.plotting` help file that covers the main options that you have for customizing the way that the network is displayed.

<div class="scrollingbox"><pre>
Drawing graphs {igraph}	R Documentation
Drawing graphs

Description

The common bits of the three plotting functions plot.igraph, tkplot and rglplot are discussed in this manual page

Details

There are currently three different functions in the igraph package which can draw graph in various ways:

plot.igraph does simple non-interactive 2D plotting to R devices. Actually it is an implementation of the plot generic function, so you can write plot(graph) instead of plot.igraph(graph). As it used the standard R devices it supports every output format for which R has an output device. The list is quite impressing: PostScript, PDF files, XFig files, SVG files, JPG, PNG and of course you can plot to the screen as well using the default devices, or the good-looking anti-aliased Cairo device. See plot.igraph for some more information.

tkplot does interactive 2D plotting using the tcltk package. It can only handle graphs of moderate size, a thousend vertices is probably already too many. Some parameters of the plotted graph can be changed interactively after issuing the tkplot command: the position, color and size of the vertices and the color and width of the edges. See tkplot for details.

rglplot is an experimental function to draw graphs in 3D using OpenGL. See rglplot for some more information.

Please also check the examples below.

How to specify graphical parameters

There are three ways to give values to the parameters described below, in section 'Parameters'. We give these three ways here in the order of their precedence.

The first method is to supply named arguments to the plotting commands: plot.igraph, tkplot or rglplot. Parameters for vertices start with prefix ‘vertex.', parameters for edges have prefix ‘edge.', and global parameters have no prefix. Eg. the color of the vertices can be given via argument vertex.color, whereas edge.color sets the color of the edges. layout gives the layout of the graphs.

The second way is to assign vertex, edge and graph attributes to the graph. These attributes have no prefix, ie. the color of the vertices is taken from the color vertex attribute and the color of the edges from the color edge attribute. The layout of the graph is given by the layout graph attribute. (Always assuming that the corresponding command argument is not present.) Setting vertex and edge attributes are handy if you want to assign a given ‘look' to a graph, attributes are saved with the graph is you save it with save or in GraphML format with write_graph, so the graph will have the same look after loading it again.

If a parameter is not given in the command line, and the corresponding vertex/edge/graph attribute is also missing then the general igraph parameters handled by igraph_options are also checked. Vertex parameters have prefix ‘vertex.', edge parameters are prefixed with ‘edge.', general parameters like layout are prefixed with ‘plot'. These parameters are useful if you want all or most of your graphs to have the same look, vertex size, vertex color, etc. Then you don't need to set these at every plotting, and you also don't need to assign vertex/edge attributes to every graph.

If the value of a parameter is not specified by any of the three ways described here, its default valued is used, as given in the source code.

Different parameters can have different type, eg. vertex colors can be given as a character vector with color names, or as an integer vector with the color numbers from the current palette. Different types are valid for different parameters, this is discussed in detail in the next section. It is however always true that the parameter can always be a function object in which it will be called with the graph as its single argument to get the “proper" value of the parameter. (If the function returns another function object that will not be called again...)

The list of parameters

Vertex parameters first, note that the ‘vertex.' prefix needs to be added if they are used as an argument or when setting via igraph_options. The value of the parameter may be scalar valid for every vertex or a vector with a separate value for each vertex. (Shorter vectors are recycled.)

size
The size of the vertex, a numeric scalar or vector, in the latter case each vertex sizes may differ. This vertex sizes are scaled in order have about the same size of vertices for a given value for all three plotting commands. It does not need to be an integer number.

The default value is 15. This is big enough to place short labels on vertices.

size2
The “other" size of the vertex, for some vertex shapes. For the various rectangle shapes this gives the height of the vertices, whereas size gives the width. It is ignored by shapes for which the size can be specified with a single number.

The default is 15.

color
The fill color of the vertex. If it is numeric then the current palette is used, see palette. If it is a character vector then it may either contain integer values, named colors or RGB specified colors with three or four bytes. All strings starting with ‘#' are assumed to be RGB color specifications. It is possible to mix named color and RGB colors. Note that tkplot ignores the fourth byte (alpha channel) in the RGB color specification.

For plot.igraph and integer values, the default igraph palette is used (see the ‘palette' parameter below. Note that this is different from the R palette.

If you don't want (some) vertices to have any color, supply NA as the color name.

The default value is “SkyBlue2".

frame.color
The color of the frame of the vertices, the same formats are allowed as for the fill color.

If you don't want vertices to have a frame, supply NA as the color name.

By default it is “black".

shape
The shape of the vertex, currently “circle", “square", “csquare", “rectangle", “crectangle", “vrectangle", “pie" (see vertex.shape.pie), ‘sphere', and “none" are supported, and only by the plot.igraph command. “none" does not draw the vertices at all, although vertex label are plotted (if given). See shapes for details about vertex shapes and vertex.shape.pie for using pie charts as vertices.

The “sphere" vertex shape plots vertices as 3D ray-traced spheres, in the given color and size. This produces a raster image and it is only supported with some graphics devices. On some devices raster transparency is not supported and the spheres do not have a transparent background. See dev.capabilities and the ‘rasterImage' capability to check that your device is supported.

By default vertices are drawn as circles.

label
The vertex labels. They will be converted to character. Specify NA to omit vertex labels.

The default vertex labels are the vertex ids.

label.family
The font family to be used for vertex labels. As different plotting commands can used different fonts, they interpret this parameter different ways. The basic notation is, however, understood by both plot.igraph and tkplot. rglplot does not support fonts at all right now, it ignores this parameter completely.

For plot.igraph this parameter is simply passed to text as argument family.

For tkplot some conversion is performed. If this parameter is the name of an exixting Tk font, then that font is used and the label.font and label.cex parameters are ignored complerely. If it is one of the base families (serif, sans, mono) then Times, Helvetica or Courier fonts are used, there are guaranteed to exist on all systems. For the ‘symbol' base family we used the symbol font is available, otherwise the first font which has ‘symbol' in its name. If the parameter is not a name of the base families and it is also not a named Tk font then we pass it to tkfont.create and hope the user knows what she is doing. The label.font and label.cex parameters are also passed to tkfont.create in this case.

The default value is ‘serif'.

label.font
The font within the font family to use for the vertex labels. It is interpreted the same way as the the font graphical parameter: 1 is plain text, 2 is bold face, 3 is italic, 4 is bold and italic and 5 specifies the symbol font.

For plot.igraph this parameter is simply passed to text.

For tkplot, if the label.family parameter is not the name of a Tk font then this parameter is used to set whether the newly created font should be italic and/or boldface. Otherwise it is ignored.

For rglplot it is ignored.

The default value is 1.

label.cex
The font size for vertex labels. It is interpreted as a multiplication factor of some device-dependent base font size.

For plot.igraph it is simply passed to text as argument cex.

For tkplot it is multiplied by 12 and then used as the size argument for tkfont.create. The base font is thus 12 for tkplot.

For rglplot it is ignored.

The default value is 1.

label.dist
The distance of the label from the center of the vertex. If it is 0 then the label is centered on the vertex. If it is 1 then the label is displayed beside the vertex.

The default value is 0.

label.degree
It defines the position of the vertex labels, relative to the center of the vertices. It is interpreted as an angle in radian, zero means ‘to the right', and ‘pi' means to the left, up is -pi/2 and down is pi/2.

The default value is -pi/4.

label.color
The color of the labels, see the color vertex parameter discussed earlier for the possible values.

The default value is black.

Edge parameters require to add the ‘edge.' prefix when used as arguments or set by igraph_options. The edge parameters:

color
The color of the edges, see the color vertex parameter for the possible values.

By default this parameter is darkgrey.

width
The width of the edges.

The default value is 1.

arrow.size
The size of the arrows. Currently this is a constant, so it is the same for every edge. If a vector is submitted then only the first element is used, ie. if this is taken from an edge attribute then only the attribute of the first edge is used for all arrows. This will likely change in the future.

The default value is 1.

arrow.width
The width of the arrows. Currently this is a constant, so it is the same for every edge. If a vector is submitted then only the first element is used, ie. if this is taken from an edge attribute then only the attribute of the first edge is used for all arrows. This will likely change in the future.

This argument is currently only used by plot.igraph.

The default value is 1, which gives the same width as before this option appeared in igraph.

lty
The line type for the edges. Almost the same format is accepted as for the standard graphics par, 0 and “blank" mean no edges, 1 and “solid" are for solid lines, the other possible values are: 2 (“dashed"), 3 (“dotted"), 4 (“dotdash"), 5 (“longdash"), 6 (“twodash").

tkplot also accepts standard Tk line type strings, it does not however support “blank" lines, instead of type ‘0' type ‘1', ie. solid lines will be drawn.

This argument is ignored for rglplot.

The default value is type 1, a solid line.

label
The edge labels. They will be converted to character. Specify NA to omit edge labels.

Edge labels are omitted by default.

label.family
Font family of the edge labels. See the vertex parameter with the same name for the details.

label.font
The font for the edge labels. See the corresponding vertex parameter discussed earlier for details.

label.cex
The font size for the edge labels, see the corresponding vertex parameter for details.

label.color
The color of the edge labels, see the color vertex parameters on how to specify colors.

label.x
The horizontal coordinates of the edge labels might be given here, explicitly. The NA elements will be replaced by automatically calculated coordinates. If NULL, then all edge horizontal coordinates are calculated automatically. This parameter is only supported by plot.igraph.

label.y
The same as label.x, but for vertical coordinates.

curved
Specifies whether to draw curved edges, or not. This can be a logical or a numeric vector or scalar.

First the vector is replicated to have the same length as the number of edges in the graph. Then it is interpreted for each edge separately. A numeric value specifies the curvature of the edge; zero curvature means straight edges, negative values means the edge bends clockwise, positive values the opposite. TRUE means curvature 0.5, FALSE means curvature zero.

By default the vector specifying the curvatire is calculated via a call to the curve_multiple function. This function makes sure that multiple edges are curved and are all visible. This parameter is ignored for loop edges.

The default value is FALSE.

This parameter is currently ignored by rglplot.

arrow.mode
This parameter can be used to specify for which edges should arrows be drawn. If this parameter is given by the user (in either of the three ways) then it specifies which edges will have forward, backward arrows, or both, or no arrows at all. As usual, this parameter can be a vector or a scalar value. It can be an integer or character type. If it is integer then 0 means no arrows, 1 means backward arrows, 2 is for forward arrows and 3 for both. If it is a character vector then “<" and “<-" specify backward, “>" and “->" forward arrows and “<>" and “<->" stands for both arrows. All other values mean no arrows, perhaps you should use “-" or “–" to specify no arrows.

Hint: this parameter can be used as a ‘cheap' solution for drawing “mixed" graphs: graphs in which some edges are directed some are not. If you want do this, then please create a directed graph, because as of version 0.4 the vertex pairs in the edge lists can be swapped in undirected graphs.

By default, no arrows will be drawn for undirected graphs, and for directed graphs, an arrow will be drawn for each edge, according to its direction. This is not very surprising, it is the expected behavior.

loop.angle
Gives the angle in radian for plotting loop edges. See the label.dist vertex parameter to see how this is interpreted.

The default value is 0.

loop.angle2
Gives the second angle in radian for plotting loop edges. This is only used in 3D, loop.angle is enough in 2D.

The default value is 0.

Other parameters:

layout
Either a function or a numeric matrix. It specifies how the vertices will be placed on the plot.

If it is a numeric matrix, then the matrix has to have one line for each vertex, specifying its coordinates. The matrix should have at least two columns, for the x and y coordinates, and it can also have third column, this will be the z coordinate for 3D plots and it is ignored for 2D plots.

If a two column matrix is given for the 3D plotting function rglplot then the third column is assumed to be 1 for each vertex.

If layout is a function, this function will be called with the graph as the single parameter to determine the actual coordinates. The function should return a matrix with two or three columns. For the 2D plots the third column is ignored.

The default value is layout_nicely, a smart function that chooses a layouter based on the graph.

margin
The amount of empty space below, over, at the left and right of the plot, it is a numeric vector of length four. Usually values between 0 and 0.5 are meaningful, but negative values are also possible, that will make the plot zoom in to a part of the graph. If it is shorter than four then it is recycled.

rglplot does not support this parameter, as it can zoom in and out the graph in a more flexible way.

Its default value is 0.

palette
The color palette to use for vertex color. The default is categorical_pal, which is a color-blind friendly categorical palette. See its manual page for details and other palettes. This parameters is only supported by plot, and not by tkplot and rglplot.

rescale
Logical constant, whether to rescale the coordinates to the [-1,1]x[-1,1](x[-1,1]) interval. This parameter is not implemented for tkplot.

Defaults to TRUE, the layout will be rescaled.

asp
A numeric constant, it gives the asp parameter for plot, the aspect ratio. Supply 0 here if you don't want to give an aspect ratio. It is ignored by tkplot and rglplot.

Defaults to 1.

frame
Boolean, whether to plot a frame around the graph. It is ignored by tkplot and rglplot.

Defaults to FALSE.

main
Overall title for the main plot. The default is empty if the annotate.plot igraph option is FALSE, and the graph's name attribute otherwise. See the same argument of the base plot function. Only supported by plot.

sub
Subtitle of the main plot, the default is empty. Only supported by plot.

xlab
Title for the x axis, the default is empty if the annotate.plot igraph option is FALSE, and the number of vertices and edges, if it is TRUE. Only supported by plot.

ylab
Title for the y axis, the default is empty. Only supported by plot.

Author(s)

Gabor Csardi csardi.gabor@gmail.com

See Also

plot.igraph, tkplot, rglplot, igraph_options

Examples

## Not run: 

# plotting a simple ring graph, all default parameters, except the layout
g <- make_ring(10)
g$layout <- layout_in_circle
plot(g)
tkplot(g)
rglplot(g)

# plotting a random graph, set the parameters in the command arguments
g <- barabasi.game(100)
plot(g, layout=layout_with_fr, vertex.size=4,
     vertex.label.dist=0.5, vertex.color="red", edge.arrow.size=0.5)

# plot a random graph, different color for each component
g <- sample_gnp(100, 1/100)
comps <- components(g)$membership
colbar <- rainbow(max(comps)+1)
V(g)$color <- colbar[comps+1]
plot(g, layout=layout_with_fr, vertex.size=5, vertex.label=NA)

# plot communities in a graph
g <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g <- add_edges(g, c(1,6, 1,11, 6,11))
com <- cluster_spinglass(g, spins=5)
V(g)$color <- com$membership+1
g <- set_graph_attr(g, "layout", layout_with_kk(g))
plot(g, vertex.label.dist=1.5)

# draw a bunch of trees, fix layout
igraph_options(plot.layout=layout_as_tree)
plot(make_tree(20, 2))
plot(make_tree(50, 3), vertex.size=3, vertex.label=NA)
tkplot(make_tree(50, 2, mode="undirected"), vertex.size=10,
vertex.color="green")

## End(Not run)
</pre></div>


In addition to the physical appearance of nodes, edges, and labels, networks are must also have a 'layout' that describes the relative position of nodes on the plot surface. There are several ways you can define a layout, here are some examples.

```{r}
layout <- layout.circle( g )
plot( g, layout=layout)
layout <- layout.fruchterman.reingold( g )
plot( g, layout=layout)
```

In addition to normal plotting, you can also integrate interactive plotting.  Here is an example using the `networkD3` library.  It is interactive, so grab one of the nodes and move it around.

```{r}
library(networkD3)
edgelist <- as_edgelist(g) 
df <- data.frame( src=edgelist[,1], target=edgelist[,2])
simpleNetwork(df,fontSize = 14,opacity = 0.95)
```


## Mapping Networks

For quick maps I typically use the **maps** library. It is pretty straightforward to use and does not take too much thought to quickly plot something or find the approporiate raster files.  Below, I add some coordinates to the data set.

```{r}
V(g)$Latitude <- c( 47.15, 48.75,38.81, 42.26, 37.74 )
V(g)$Longitude <- c(-122.89,-122.49,-89.98, -93.47, -77.16 )
```

Then overlay this onto a map using the `overlay_popgraph()` function.    Here is an example where I plot it over the map of the US states.

```{r,message=FALSE,fig.cap="Map of graph stetched onto continential US map."}
library(maps)
 library(popgraph)
pg <- as.popgraph( g )

map( "state" )
overlay_popgraph( pg )
```

This function **requires** that you already have a plot available (it uses the `lines()` and `points()` routines).  If you try to just overlay this with no existing plot, it will not work (and should throw an error).


## Genetic Distance Graphs

Several graph-theoretic approaches have been suggested in the literature, some of which are based upon statistical models (e.g., popgraphs in the next chapter) and some of which are less structured.  

A common approach has been to use a measure of pair-wise genetic distance, measured between individuals or strata.  In the following example, Nei's genetic distance (see Chapter \@ref(genetic-distance)) is used.

```{r}
 library(gstudio)
data(arapat)
d <- genetic_distance(arapat,mode="Nei")
```

yielding a $KxK$ distance matrix.  Nei's distance produces values that are non-negative 

```{r}
d[1:6,1:6]
```

and in the case of the `arapat` data, produces a bivariate distribution of distances.

```{r warning=FALSE, message=FALSE}
 library(ggplot2)
qplot(d[lower.tri(d)],geom = "histogram", bins=40)
```

Since Nei's distance is 


<!--chapter:end:distance_networks.rmd-->

# Population Graphs {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_euphorbia.jpg"></div>

Population Graphs are a statistical representation of among population genetic variance, $\sigma^2_A$, as viewed through a network (Dyer & Nason 2004).  A population graph is a graph-theoretic interpretation of genetic covariance and serves as a tool for understanding underlying evolutionary history for a set of populations.  

These structures are estimated in R using the **popgraphs** library.  

```{r,message=FALSE, warning=FALSE}
library(popgraph)
```

As other networks, a population graph is a graph-theoretic structure that can be represetendHere we will focus on the former approach as it is native to this package.  If you use the latter one, it will produce a *.pgraph file and you can read it in using the `read_popgraph()` function.


## Adding data to a graph

A population graph is made more informative if you can associate some data with topology.  External data may be spatial or ecolgoical data associated with each node.  Edge data may be a bit more complicated as it is traversing both spatial and ecolgoical gradients and below we'll see how to extract particular from rasters using edge crossings. 

Included in the *popgraph* package are some build-in data sets to illustrate some of the utility. Included is the cactus topology that was originally used to develop this approach (from Dyer & Nason 2004).

```{r}
data(lopho)
class(lopho)
lopho
```

We can associate data with the nodes using the `decorate_graph()` function.  This takes a `data.frame` object and tries to match up the columns of data in the data.frame to the nodes.  Here is an example with some addition built-in data.  The option `stratum` indicates the name of the column that has the node labels in it (which are stored as `V(graph)$name`).

```{r}
data(baja)
summary(baja)
lopho <- decorate_graph( lopho, baja, stratum="Population")
lopho
```

Each vertex now has seveal different types of data associated with it now.  We will use this below.


### Plotting a graph using **ggplot2** routines

The **ggplot2** package provides a spectacular plotting environment in an intuitive context and there are now some functions to support the Population Graphs in this context.  

If you haven't used **ggplot2** before, it may at first be a bit odd because it deviates from normal plotting  approaches where you just shove a bunch of arguments into a single plotting function.  In **ggplot**, you build a graphic in the same way you build a regression equation.  A regression equation has an intercept and potentially a bunch of independent terms.  This is exactly how **ggplot** builds plots, by adding togther components.

To specifiy how things look in a plot, you need to specify an aesthetic using the `aes()` funciton.  Here is where you supply the variable names you use for coordinate, coloring, shape, etc.  For both of the `geom_*set` funcitons, these names **must** be attributes of either the node or edge sets in the graph itself.

Here is an example using the *Lopohcereus* graph.  We begin by making a `ggplot()` object and then adding to it a `geom_` object.  The 5**popgraph** package comes with two funcitons, one for edges and one for nodes.  

```{r}
library(ggplot2)
p <- ggplot() 
p <- p + geom_edgeset( aes(x=Longitude,y=Latitude), lopho ) 
p
```

I broke up the plotting into several lines to improve readability, it is not necessary to to this in practice though.  The addition of additional `geom_` objects to the plot will layer them on top (n.b., I also passed the *size=4* option to the plot as the default point size is a bit too small and this is how you could change that).

```{r}
p <- p +  geom_nodeset( aes(x=Longitude, y=Latitude), lopho, size=4)
p
```

And then you can add additional options to the plot, like axis labels and a less exciting background theme (the `theme_empty()` provided by **popgraph** is actually transparent so you can save the image and throw it into a presentation as necessary).


```{r}
p <- ggplot() + geom_edgeset( aes(x=Longitude,y=Latitude), lopho, color="darkgrey" )
p <- p + geom_nodeset( aes(x=Longitude, y=Latitude, color=Region, size=size), lopho) 
p <- p + xlab("Longitude") + ylab("Latitude") 
p + theme_empty()
```

You can also use the default layout routines in *igraph* for visualization.  Here is an example using Fruchterman-Reingold algorithm.

```{r message=FALSE, warning=FALSE}
library(igraph)
c <- layout.fruchterman.reingold( lopho )
V(lopho)$x <- c[,1]
V(lopho)$y <- c[,2]
p <- ggplot() + geom_edgeset( aes(x,y), lopho, color="darkgrey" )
p <- p + geom_nodeset( aes(x, y, color=Region, size=size), lopho) 
p + theme_empty()
```


### Reading Existing **popgraph** Files


The online versions of Population Graphs provides a *.pgraph file for visualization. These files are visualized in several different software platforms including GeneticStudio (Dyer 2009), a OpenGL visualization application (Dyer &amp; Nason 2004), an online visualization framework at http://dyerlab.org, and of course, in R.  We shall focus on this last one.  Reading in files to R 

```{r,eval=FALSE,echo=TRUE}
graph <- read.popgraph( "thegraph.pgraph" )
```


### Saving Population Graph Objects


A **popgraph** object is a normal R object and can be saved using the normal R mechanisms.

```{r,echo=TRUE,eval=FALSE}
save( lopho, file="MyLophoGraph.rda")
```


For interoperability, **popgraph** objects can also be saved in other formats. These are accessed through the `write.popgraph()` function.  

```{r,echo=TRUE,eval=FALSE}
write.popgraph(lopho,file="~/Desktop/Cactus.pgraph", format="pgraph")
```
There are several other options available for outputing your graph.  Currently the other formats that have been implemented are:

1. `json` A format for html javascript data processing.  
+ `kml` The Keyhole Markup Language which is read by GoogleEarth.  This requires `Latitude` and `Longtitude` vertex properties.
+ `graphml` The graph markup language.
+ `html` Export as an interactive html document you can manipulate on your desktop (uses javascript d3.js library so you need an internet connection).
+ `pajek` Export to a format that works with the software Pajek (http://pajek.imfm.si/doku.php?id=pajek)
+ `pgraph` The format used in GeneticStudio and the original popgraph 3D viewer (this is the default).
+ `adjacency` Saves the adjacency matrix of the graph (binary) as a CSV file
+ `paths` Saves the shortest paths matrix as a CSV file
+ `weights` Saves the adjacency matrix with edge weights.


## Interactive Networks

To create a popgraph, you need to pass the `popgraph()` function genotypes as multivariate variables—the function `to_mv()` does this behind the scene—and a vector of variables that allocate each row of data to a node.  Here we use the 'Population' vector from the arapat `data.frame`.

```{r interactive1, message=FALSE, warning=FALSE}
library(gstudio)
library(popgraph)
data(arapat)
graph <- popgraph(to_mv(arapat),groups = arapat$Population)
print(graph)
```

The `forceNetwork()` function is what does the plotting and it needs some data that are in a specific format.  Essentially, there needs to be two `data.frame` objects with the following attributes:  

 - **nodes** - A `data.frame` with each row representing the name of the node to be displayed, the group the node belongs to (if there are groupings of nodes to be displayed by alternative colors), and a vector of node sizes.  
 - **edges** - A `data.frame` representing the edge connecting the nodes, labeled as 'from' and 'to' and a vector of weights.  
 
The 'from' and 'to' vectors need to be numeric values of the nodes in the other data frame **and** need to be 0-indexed (e.g., the first node name it is going to look up is indexed as '0' in the javascript instead of '1' as is common in R).  


```{r message=FALSE, warning=FALSE}
nodes <- to_data.frame( graph, mode="nodes", as.named=FALSE )
edges <- to_data.frame( graph, mode="edges", as.named=FALSE )
edges$source <- edges$source - 1
edges$target <- edges$target - 1 
```

The only last thing to do is to define a grouping of populations.  This will be represented in the network as a coloring.  For this one, I'm going to use the hypothesized STRUCTURE clustering (see \@ref(admixture) for how this was done).  In the `arapat` data set, there is a designation for each individual on which cluster they belong.  Some populations are 'pure' in their groupings but others (in spatial regions of sympatry) they are mixed.  Below I determine the estimated STRUCTURE groups for each population and collapse those who have more than one into a single string.


```{r}
grps <- by( arapat$Cluster, arapat$Population, unique )
l <- lapply( grps, function(x) { g <- paste(sort(x),collapse="/") })
df <- data.frame( name=names(l), group=as.character(l))
nodes <- merge( nodes, df )
```


Once defined, we can then call the function to make the `data.frame` objects and then do the plot.  These graphics are *interactive*, grab a node and drag it around!

```{r}
library(networkD3)
forceNetwork(Links = edges, Nodes = nodes,
            Source = "source", Target = "target",
            Value = "value", NodeID = "name",
            Group = "group", opacity=1.0,
            legend=TRUE, fontSize = 16,
            zoom=TRUE )
```


## Spatial Population Graphs


Mapping the nodes and edges onto real space is a key task in the understanding of how covariance is partitioned on the landscape.  There are several approaches that can be used in R since it is such a flexible platform.  In what follows I will use a series of techniques that I find useful ordered from the simplest to the more complicated.  


### Integrating Google and **ggplot2** for Plotting

R has some pretty good facilities for using spatial assests from Google and OpenStreetMaps and is a very easy way to get quick plots from Population Graphs, particularly if you can integrate it into the **ggplot2** framework.

Using the **ggmap** package, you can request map tiles and use as backgrounds onto which you can plot specific objects.  To do so, you must first get:

- Either the centroid of the location you are interested in finding and a value of zoom (just like in google maps), or
- A bounding box with left, bottom, right, top coordinates.  This is a bit of an experimental thing and does not always get you what you want.  

Some fiddling is required with either way you go.  The map you get from `get_map()` is essentially a huge matrix of hex colors as shown above.

```{r, message=FALSE}
library(ggmap)
location <- c( mean(V(lopho)$Longitude), mean(V(lopho)$Latitude))
location
map <- get_map(location,maptype="satellite", zoom=6)
dim(map)
map[1:4,1:4]
```

This map object can be passed on to `ggmap()`, which replaces the traditional `ggplot()` funciton and sets up the bounding box in terms of Latitude and Longtidue.  Onto this, you can plot the graph topologoy using:

- `geom_edgeset()` This takes the graph and plots out the edges.  
- `geom_nodeset()` This plots out the nodes.  You could probably use a regular *data.frame* and `geom_point()` as well. Here is an example:

```{r}
p <- ggmap( map ) 
p <- p + geom_edgeset( aes(x=Longitude,y=Latitude), lopho, color="white" ) 
p <- p + geom_nodeset( aes(x=Longitude, y=Latitude, color=Region, size=size), lopho) 
p + xlab("Longitude") + ylab("Latitude")
```


### Integrating Raster Maps

At times we have raster data upon we can plot a population graph. Here is an example from Baja California. The underlying raster image is croped from a WorldClim tile and represents elevation.  

```{r,message=FALSE}
library(raster)
data(alt)
plot(alt)
```

Since it is a **raster** object, it knows how to plot itself relatively well.  There are a ton of good references for showing you how to play with raster data (e.g., Bivand *et al.* 2008)

To plot our graph onto this topology, we export the spatial components of the graph into objects that interact with rasters.  The packages provides simple extraction of features into `SpatialLines` and `SpatialPoints` objects.

```{r}
lopho.nodes <- to_SpatialPoints(lopho)
lopho.nodes
lopho.edges <- to_SpatialLines(lopho)
head(lopho.edges)
```

Once we have them extracted into the right format, we can add them to the raster plot.  I plot the nodes twice to overlay a circular icon (`pch=16`) onto the default cross marker and make them 50 \% larger (`cex=1.5`).

```{r, fig.width=8}
plot( alt )
plot( lopho.edges, add=TRUE, col="#555555" )
plot( lopho.nodes, add=TRUE, col="black", cex=1.5 )
plot( lopho.nodes, add=TRUE, col=V(lopho)$color, pch=16, cex=1.5 )
```


## Extracting Spatial Data Using Population Graphs


Since we are dealing with the spatial stuff right now, it makes sense to look into how we can use the topological features of the graph to extract spatial data.  


### Node Specific Data

The node data `nodes` is a `SpatialPoints` object and can be used to pull data from raster sources.  I'll start by creating a `data.frame` with some existing data in it.

```{r}
df.nodes <- data.frame(Pop=V(lopho)$name, Latitude=V(lopho)$Latitude, Longitude=V(lopho)$Longitude)
```


Then we can extract the elevation from the `alt` raster as:

```{r}
library(raster)
df.nodes$Elevation <- extract( alt, lopho.nodes )
summary(df.nodes)
```

Additional data could be extracted from other rasters.  See http://worldclim.org for some example data that may prove useful.

### Extracting Data Along Popgraph Edges

It is also possible to extract data along vectors, or other *SpatialLines* objects, which the edges in a popgraph can be tranformed into.  This is a particularly helpful approach if you are trying to quantify the value of characteristics of the environment between your sampling locations.  In the following example, I estimate the popualtion graph from the `arapat` data set 

```{r popgraphInBaja, warning=FALSE,message=FALSE, echo=-1}
graph <- popgraph( to_mv( arapat ), arapat$Population )
coords <- strata_coordinates(arapat)
graph <- decorate_graph( graph, coords )
edges <- to_SpatialLines(graph)
proj4string(edges) <- CRS( proj4string( alt ))
plot( alt, legend=FALSE)
plot(edges,add=TRUE)
```

and determine which of the edges has the longest length.  

```{r}
edge_lengths <- SpatialLinesLengths( edges )
longest <- sort( edge_lengths,decreasing = TRUE )[1]
longest
```

This edge is found at:

```{r echo=-1}
idx <- which( edge_lengths == longest )
idx
edge <- edges[ 37 ]
edge
plot( alt, legend=FALSE)
plot(edge,add=TRUE)
```

From this edge object (or a collection of edge objects) we can again extract values from a raster.  Here I pull out the elevation profile of this edge.  To plot it, I need to make a sequence of latitude values equal in length to that of the observed elevation values I extracted.  

```{r warning=FALSE,message=FALSE}
elev <- extract( alt, edge )[[1]]
from_lat <- V(graph)$Latitude[V(graph)$name == "SFr"]
to_lat <- V(graph)$Latitude[ V(graph)$name == "164"]
lat <- seq( from=from_lat, to=to_lat, length.out = length(elev) )
df <- data.frame( Latitude=lat, Elevation=elev)
p <- ggplot( df, aes(Latitude,Elevation)) + geom_line(color="lightgrey") 
p + geom_point() + ylab("Elevation (m)")
```


Population Graphs have been used to determine if there is an preference (or avoidance) of a specific raster value for the location of individual edges on the landscape. For example, Dyer *et al.* (2012) were interested in determining if there the edge in the pollination graph (e.g., a population graph constructed from male pollen haplotypes) preferentially traverse (or avoid) specific intervening habitat features.  To do this, they permuted the edge set among nodes in the graph and recorded the prevalence (mean and variance) of specific features extracted from specific categorical rasters representing both canopy and understory features.  The permutation of a network should be done such that you preserve aspects of the spatial heterogeneity and spatial arrangement of the nodes on the landscape.  You probably do not want to permute all edges randomly in the graph (though this option is available in the code), but would probably be better served by permuting the network while maintaining both the node coordinates (e.g., where they exist on the landscape) as well as the degree distribution of the overall network.  This second criteria holds constant higher network structure.  The general idea is to:  

1. Create a saturated graph and extract ecological features for all potential connections.  This gives us a `data.frame` within which we can pull out ecological values for each permutation.  This is the most computationally intensive process and doing it once and then extracting values from the `data.frame` for each permutation is a more efficient approach.
2. Once you have all the potential values of your features, you can permute the observed matrix, while holding both the connection probability (e.g., the number of edges) and the degree distribution (e.g., the amount of edges connected to nodes) constant using the `randomize_graph` function included in the `popgraph` library.  For each permutation, you then compile the permuted environmental factor as a null distribution and then compare those to the observed.  

This may sound a bit convoluted, but this example may help.  Consider the hypothetical case where we think that the edges in the population graph from Figure \@ref(fig:popgraphInBaja), are restricted in elevation because we believe that dispersing insects fly around high elevation mountains rather than over them.^[This is a truly hypothetical and contrived example to show how you would do this in the code and is not motivated by any biologically motivated reasons.  Ya' gotta make up examples sometimes...]  If this is true, then we should expect that the average (or max) elevation along any of the observed edges in the Population Graph would be less than what would be expected if we permuted the edges among nodes and measured elevation along edges from permuted graphs.

First, we need to set up the network and extract values of elevation along all potential edges.  I make a saturated graph from all potential 

```{r}
data(baja)
graph <- popgraph( to_mv( arapat ), arapat$Population )
graph <- decorate_graph( graph, coords )
allpops <- V(graph)$name
```

I then can make an adjacency matrix connecting all pairs of populations

```{r}
A <- matrix(1,nrow=length(allpops),ncol=length(allpops))
diag(A) <- 0
rownames(A) <- colnames(A) <- allpops
saturated_graph <- graph.adjacency(A, mode = "undirected")
saturated_graph <- as.popgraph( saturated_graph )
```

From which I can pull all the edges (all `r 0.5*nrow(A)*(nrow(A)-1)` of them) as `SpatialLines` objects

```{r}
saturated_graph <- decorate_graph( saturated_graph, coords )
all_edges <- to_SpatialLines( saturated_graph )
```

From these `r length(all_edges)` `SpatialLines` objects, we can extract data from the elevation raster.  

```{r echo=TRUE, eval=FALSE}
edge_values <- extract( alt, all_edges, fun=max, na.rm=TRUE, df=TRUE)
```

This will take a bit of time to complete.  The options that I provided were:  
- `fun=max` - The function used is the max function.
- `na.rm=TRUE` - Ignore all missing data (e.g., when an edge crosses water on the `alt` raster, the extracted values are `NA`)
- `df=TRUE` - Return the answer as a `data.frame` object instead of just a vector.

```{r eval=TRUE, echo=FALSE}
load("./spatial_data/edge_values.rmd")
```

This `data.frame` has two columns, one for edge number and the other for value.  I'm going to put an additional pair of columns with the names of the nodes the edges are connected to into this `data.frame`

```{r}
edge_names <- as_edgelist( saturated_graph )
edge_values$Nodes <- paste( edge_names[,1], edge_names[,2], sep="-")
head(edge_values)
```

This constitutes *all potential* connections across the landscape.  From this we can extract the edges that we observed in the original Population Graph

```{r}
e <- as_edgelist( graph )
obs <- edge_values$alt_22[ edge_values$Nodes %in% paste( e[,1], e[,2], sep="-") ]
mean(obs)
```

We can now permute the network a moderate number of times and take the values of permuted elevation to see if our observed are smaller than all potential elevations for this specific network.

```{r}
perm_elev <- rep(NA,999)
for( i in 1:length(perm_elev) ) {
  perm_graph <- randomize_graph( graph )
  e <- as_edgelist( perm_graph )
  perm_val <- edge_values$alt_22[ edge_values$Nodes %in% paste( e[,1], e[,2], sep="-") ]
  perm_elev[i] <- mean(perm_val)
}
```

Now, we can see where the observed value occurs in the distribution of elevations created under the null hypothesis of no difference in elevation across edges.

```{r}
df <- data.frame( Elevation=c(mean(obs),perm_elev), Category=c("Observed",rep("Permuted",999)))
ggplot( df, aes(x=Elevation,fill=Category)) + geom_histogram(stat="bin", bins=40) + xlab("Elevation (m)") + ylab("Distribution of Permuted Elevations")
```

In fact, we can estimate the probablity as:

```{r}
sum( mean(obs) >= perm_elev )
```

As it turns out, the observed edges do in fact appear to be traversing lower elevations than the potential set of edges that could be present (while controlling for spatial location of populations and graph structure).  

<!-------------  XXXXXXX   ---------------->


## Extracting Graph-Theoretic Parameters


The underlying structure of a **popgraph** object is based upon the **igraph** package from Gabor Csardi.  A population graph is essentially a specific kind of **igraph**object and can be decorated with metadata that is useful for spatial population genetic analyses.  As such, there is a wealth of existing analyses from both the **igraph** as well as the **sna** packages that can be used on **popgraph** objects.  Here are some examples.

### Matrix Representations of Population Graph Objects

A graph topology is a graphical representation of a matrix and there are several reasons why you may want to use these matrices.  The function `to_matrix()` is an easy front-end to several kinds of matrices.  Matrix structure itself can be defined by adjacency matrics, either binary (the default) or weighed by the edge weight.  Several graph-theoretic parameters are derived from the adjacency matrix.  Here is an example from our little graph that started this document.

```{r}
to_matrix( lopho, mode="adjacency")[1:5,1:5]
to_matrix( lopho, mode="edge weight")[1:5,1:5]
```

In addition to who each node is connected to, it is often of interest to know the length of the shortest path through the matrix connecting nodes.  Here is a slightly larger example, using the cactus data so we can look at isolation by graph distance.

```{r}
cGD <- to_matrix( lopho, mode="shortest path")
cGD[1:5,1:5]
```

It should be noted that the shortest distance through a population graph is defined as the parameter $cGD$, conditional graph distance (see Dyer *et al.* 2010 for more information on this parameter).

Now, we need the physical distance between the nodes.  If the physical size of the sampling area is small we could just use the Pythagorean equation.  However, here the distance is relatively large and the curvature of the earth may be of interest to take into account.  There are seveal functions that will calculate 'great circle distance' but the easiest is `rdist.earth()` from the **fields** funtion.  


```{r,warning=FALSE,message=FALSE}
library(gstudio)
df <- data.frame( Stratum = V(lopho)$name, 
                  Longitude  =V(lopho)$Longitude, 
                  Latitude = V(lopho)$Latitude ) 
pDist <- strata_distance( df )
```

Now, we can plot these values against each other to see if there is a pattern of 'isolation by distance' captured in the graph topology.  To do this, I extract only the upper triangle (e.g., the values above the diagonal of each matrix) because they are symmetric matrices and we do not want to look at each datum twice.

```{r}
df <- data.frame( cGD=cGD[upper.tri(cGD)], Phys=pDist[upper.tri(pDist)])
cor.test( df$Phys, df$cGD, method="spearman")
```

We can plot these values and make a trendline pretty easily.  Here is a plot using **ggplot2** (a very nice plotting library; you could use `plot()` to do the normal plotting but I think **ggplot2** does such a nice job I encourage its use).  

```{r,message=FALSE}
qplot( Phys, cGD, geom="point", data=df) + stat_smooth(method="loess") + xlab("Physical Distance") + ylab("Conditional Genetic Distance")
```

The trendline is the loess predicted line with confidence interval.  


### Node Specific Parameters

Features of the topology can be extracted as either properties of the nodes or the edges. Node properties may provide insights into localized processes (e.g., a 'sink' population).  There are a lot of different parameters that can be derived and several packages in R that help out.  Here are some basic ones.

```{r}
df.nodes$closeness <- closeness(lopho)
df.nodes$betweenness <- betweenness(lopho)
df.nodes$degree <- degree( lopho )
df.nodes$eigenCent <- evcent( lopho )$vector
df.nodes$Region <- factor(V(lopho)$Region)
summary(df.nodes,color="Region")
```


The relationship betwwen the node variables can be evaluated in a pair plot.

```{r,message=FALSE,warning=FALSE,fig.width=8}
library(GGally)
ggpairs(df.nodes,columns=2:9, color='Region')
```


### Edge Specific Parameters


Edges may have specific properties as well.  Here are some examples using betweeness centrality, community, and regionality (if the edge connects within Baja or Sonora or crosses the Sea of Cortéz).

```{r}
df.edge <- data.frame( Weight=E(lopho)$weight )
df.edge$betweenness <- edge.betweenness(lopho)
df.edge$Region <- rep("Baja",52)
df.edge$Region[36:52] <- "Sonora"
df.edge$Region[c(11,24,27,35)] <- "Cortez"
ggpairs(df.edge, color="Region")
```


## Testing for Topological Congruence

If you have more than one topology and the node sets overlap sufficiently, you can test for the topological congruence of the two.  There are several specific reasons why population graph topologies may be congruent.  In this section we use the spatial genetic structure of the Senita cactus (*Lophocereus schottii*) and its obligate pollinator, *Upiga virescens* as an example.  This is an obligate pollination mutualism and as such we should expect there to be some degree of spatial genetic congruence between the species due to this co-evolution.  


```{r}
data(upiga)
upiga <- decorate_graph(upiga,baja,stratum="Population")
upiga.nodes <- to_SpatialPoints(upiga)
upiga.edges <- to_SpatialLines(upiga)
```

These two data sets were 'mostly' collected in the same physical locations.  Here is a comparison of the two topologies.

```{r,fig.width=9}
par(mfrow=c(1,2))
plot(lopho)
plot(upiga)
```

You can clearly see some differences in both the node and edge sets.  However, given the relationship between these organisms, there is an expectation that they should share some spatial structure.  The function `congruence_topology()` is designed to extract the congruence graph that is the intersection of both node and edge sets.  If the node sets are not completely overlapping (in this case they are not), it will give you a warning.  If you want to compare topologies, you must start with identical node sets because the topology in a Population Graph is based upon the entire structure, not just pairwise differences.  See the writeup about the **gstudio** package for more information on this.  

```{r,message=FALSE}
cong <- congruence_topology(lopho,upiga)
plot(cong)
```

We can then take the congruence graph and plot it or work with it in the normal fashion.

```{r}
cong <- decorate_graph( cong, baja )
cong.nodes <- to_SpatialPoints(cong)
cong.edges <- to_SpatialLines(cong)
plot(alt)
plot(cong.edges,add=T)
plot(cong.nodes,add=T, pch=16, col="red")
```


There are several ways to examine 'congruence' in graph topologies, of which I show two.  The first method is based upon the correlation of pair-wise distance through the graph for each.  That is to say, are proximate nodes in `lopho` similarily close in `upiga`?   This is called "Distance Congruence" and is based upon a non-parametric correlation of path distances.

```{r,warning=FALSE}
test_congruence(lopho,upiga,method="distance")
```

Another way is to ask about the pattern of connectivity. Based upon the number of nodes and edges in `lopho` and `upiga`, are there more in the congruence graph than could be expected if the two graph were just randomly associated?  This is called "Structural Congruence" and is determined combinatorially.  What is returned by this is the probability having as many edges as observed in the congruence graph given the size of the edge sets in the predictor graphs. You can think of this as the fraction of the area under the line as large or larger than the observed.

```{r,warning=FALSE}
test_congruence(lopho,upiga, method="combinatorial")
```


Hope this provides enough of an overview of the **popgraph** package to get you started.  If you have any questions feel free to email [mailto://rjdyer@vcu.edu].


<!--chapter:end:population_graphs.rmd-->

# Interactive Visualization 

Interpreting complicated networks can be difficult if the number of nodes and/or edges are large, due in large part due to the way we project network structure onto 2-dimensional spaces.  While publication-quality graphics are needed in the end, interpretation can be aided by plotting networks in a dynamic context.  Here we look at a Population Graph constructed using the *Araptus attenuatus* genetic data set.

## Procedure

First, we need to load in the proper libraries. Each of the libraries are needed for:

 - **gstudio** - Contains the *A. attenuatus* data set.  
 - **popgraph** - Performs the `popgraph()` analysis.  
 - **networkD3** - Loads in the javascript D3 library and allows the interactive plot.  

### Interactive Networks


<!--chapter:end:interactive_visualization.rmd-->

# (PART) Spatial Data {-}

<!--chapter:end:part_spatial_data.rmd-->

# Vector Data  {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_newt2.jpg"></div>

For the purposes of this chapter, I will consider vector data as consisting of a finite set of points that may or may not be connected.  In `R`, these points can be used directly, as a numeric data type, or as a Spatial* class object.  The `sp` library contains a lot of functions that help deal with points, lines, and polygons and this is going to be a short overview of how they can be derived and manipulated in the pursuit of population genetic studies.

The construction of Spatial* objects is a bit convoluted.  I don't appreciate why that is, it is just the way it was constructed.  Here is a set series of examples to get you going.  I will wait until subsequent chapters for us to use these structures in analyses and data extraction, this is mostly a quick tutorial on how to create these objects.

## Points

Points are defined by *SpatialPoints* objects.  A collection of points may have additional data associated with each location and would make a *SpatialPointsDataFrame*.  This is a bit different than the normal `data.frame` objects we've been using with coordinates in them already---in fact it is the opposite of that.  It is a set of points within which is located a `data.frame` rather than `data.frame` that has within it a set of points.  

Confused yet?  Lets get to the point and make some coordinates.  Here is the way we've extracted points from the *Arapatus attenuatus* data set thus far.

```{r chunk-vector}
library(gstudio)
data(arapat)
coords <- strata_coordinates(arapat)
summary(coords)
```

However, we can also derive these points directly as a *SpatialPoints* object defined in the *sp* library by setting the optional flag `as.SpatialPoints=TRUE`.

```{r}
library(sp)
library(raster)
pts <- strata_coordinates( arapat, as.SpatialPoints = TRUE )
pts
```

Notice that there is no coordinate reference system in the default extraction.  This is because you can pass a wide array of coordinates to this function and it only takes the centroid.  It is up to you to define the projection and datum for the data.  If it is Long/Lat data as in the example, it can be defined as:

```{r}
proj4string(pts) <- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
pts
```

Any set of x- and y- coordinates can be turned into a *SpatialPoints* object. If we are to associate data with those points, the data has to have the same number of observations as there are coordinates.  In our case here, we have 39 populations and as an example I'll determine the number of individuals genotyped in each population as a 

```{r}
df <- data.frame( table(arapat$Population) )
names(df) <- c("Population","N")
pts.df <- SpatialPointsDataFrame(pts,df)
pts.df
```

You can translate it back into a `data.frame` object as:

```{r}
as.data.frame( pts.df )[1:5,]
```

or access the data within the `data.frame` directly (thereby not needing to make a new object) using the attribute `@` operator

```{r}
pts.df@data[1:5,]
```

Since it is a *SpatialPoints* object, you can get information about it such as the bounding box (e.g., the coordinates of a box that encloses all the points).

```{r}
bbox(pts.df)
```


## Lines

Lines are created by pairs of points.  A single *Line* object 

```{r}
c1 <- cbind(coords$Longitude[1:2], coords$Latitude[1:2])
c2 <- cbind(coords$Longitude[2:3], coords$Latitude[2:3])
L1 <- Line(c1)
L2 <- Line(c2)
L1
coordinates(L1)
```

A collection of *Line* objects can be put into a *Lines* object.

```{r}
Ls1 <- Lines( list(L1), ID="88 to 9")
Ls2 <- Lines( list(L2), ID="9 to 84")
Ls1
```
 
And if they are spatial in context (e.g., if we need to plot them in any way, shape, or form), we need to put them into a *SpatialLines* object, which is also constructed from a `list` of *Lines* objects.

```{r}
SLs <- SpatialLines( list(Ls1,Ls2))
proj4string(SLs) <- CRS(proj4string(pts))
SLs
```

If we want to add data to the set of lines, we can associate a `data.frame` with each of them with internal data.

```{r}
df <- data.frame( Sequence = c("First","Second"), Like_It= c(TRUE,FALSE), row.names = c("88 to 9","9 to 84"))
SLDF <- SpatialLinesDataFrame( SLs, df )
SLDF
```

```{r}
as.data.frame(SLDF)
```

We can also extract the line lengths of each line.

```{r}
SpatialLinesLengths(SLs, longlat = TRUE)
```

## Polygons 

A polygon is simply a collection of line segments that closes in on itself.  We can use polygons to identify habitat, define boundaries, etc.  In the short description to follow, we will create a set Polygon* objects, culminating in a *SpatialPolygonsDataFrame* object.

We will start with the first 5 coordinates in the `arapat` data set. To make the polygon, we **must** close the coordinates, which means take the first one we put in and append it to the end of the list of coordinates, such that in this case `c[1,] == c[6,]`.

```{r}
c <- cbind( coords$Longitude[1:5], coords$Latitude[1:5])
c <- rbind( c, c[1,])
P <- Polygon( c )
P
```

As you can see, there is some additional information provided in the default layout.  A few points to be made:  
- The `area` parameter is not georeferenced as the polygon itself has no projection.  
- The `labpt` is the coordinate where a label would be plot.
- The `hole` and `ringDir` determine if the polygon represent a hole in some other polygon (e.g., the doughnut hole and the direction it is plot).

Similar to how we constructed *SpatialLines*, a *Polygon* must be in inserted into a set of *Polygons*

```{r}
Ps <- Polygons( list(P), ID="Bob")
```

From which a list of can be created to make a *SpatialPolygons* object

```{r}
SPs <- SpatialPolygons( list(Ps))
proj4string(SPs) <- CRS(proj4string(pts))
SPs
```

And data can be added to it making a *SpatialPolygonsDataFrame* (n.b., The `row.names` of the `data.frame` *must* match the `ID` we set for making the *Polygons* objects).  If they do not, there will be an error thrown.

```{r}
df <- data.frame(Populations=paste(coords$Stratum[1:5],collapse=", "), row.names = "Bob")
SPDF <- SpatialPolygonsDataFrame( SPs, df)
SPDF
```


## Saving Vector Objects

As all of these objects are R objects, they can be saved to disk using the `save()` function, which makes them a *.rda object.  If you have objects that take a bit of time to create, it is in your best interests to save them after creation and on subsequent analyses, just use the saved versions.


<!--chapter:end:vector_data.rmd-->

# Raster Data {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_jaguar.jpg"></div>

Rasters are a form of data that is georeferenced and (somewhat) continuous.  Raster data is perhaps best envisioned as a matrix of values, whose entries represent spatially referenced data points.  The raster itself can be visualized as you could for normal matrix output.  What makes a raster different, however, is that it is (or should be) georeferenced. This means that each element of the matrix represents some measurement on the ground having a specific location and spread.  This is analogous to an image, where if you zoom in on it enough, you will be able to differentiate between individual pixels, it is just that for rasters, each pixel has a spatial location and size associated with it that we can map onto the earth.

You can either create raster objects *de novo* or you can acquire them from some external source.  To create one from scratch, you start with a matrix of values and then construct the raster object using the `raster()` function as:

```{r chunk-raster, message=FALSE,warning=FALSE}
library(raster)
r <- matrix(runif(10000),nrow=100)
rnd <- raster( r )
```

which can be visualized using the normal plot command. The raster library has overridden several of the plotting functions and you can plot raster objects and decorate the images in the same way you do for normal plotting materials (\@ref(graphics)).

```{r}
plot(rnd)
```

There are also many available repositories for raster data including Open Source, Governmental, and Municipal locations.  One common source for these data is that of http://worldclim.org.  This repository contains temperature and precipitation data generalized for the entire globe.  

```{r echo=FALSE}
knitr::include_graphics("./media/WorldClimTiles.png")
```

These data are available free of charge and have been used in numerous biological studies.  Moreover, they provide a set of 'biologically relevant' layers, called `BioClim`, that summarize both temperature and precipitation. They motivate these by saying:

> Bioclimatic variables are derived from the monthly temperature and rainfall values in order to generate more biologically meaningful variables. These are often used in ecological niche modeling (e.g., BIOCLIM, GARP). The bioclimatic variables represent annual trends (e.g., mean annual temperature, annual precipitation) seasonality (e.g., annual range in temperature and precipitation) and extreme or limiting environmental factors (e.g., temperature of the coldest and warmest month, and precipitation of the wet and dry quarters).

These layers are encoded into 19 Bio-layers as defined in Table \@ref(tab:bioclim).  These layers are available for download from their site directly (I recommend using the tiles approach so you do not have to download the entire world map) as well as from the R package [dismo](https://cran.r-project.org/web/packages/dismo/index.html).  

```{r bioclim, echo=FALSE}
df <- data.frame( Layer=paste( rep("BIO",19), 1:19, sep=""))
df$Description = c("Annual Mean Temperature", "Mean Diurnal Range (Mean of monthly (max temp - min temp))", "Isothermality (BIO2/BIO7 * 100)", "Temperature Seasonality (standard deviation * 100)", "Max Temperature of Warmest Month","Min Temperature of Coldest Month", "Temperature Annual Range (BIO5-BIO6)","Mean Temperature of Wettest Quarter","Mean Temperature of Driest Quarter","Mean Temperature of Warmest Quarter","Mean Temperature of Coldest Quarter","Annual Precipitation","Precipitation of Wettest Month","Precipitation of Driest Month","Precipitation Seasonality (Coefficient of Variation)","Precipitation of Wettest Quarter","Precipitation of Driest Quarter","Precipitation of Warmest Quarter","Precipitation of Coldest Quarter")
knitr::kable( df, booktabs=TRUE, caption="Key to the categories of bioclim variables derived from temperature and precipitation models for current (past and future) conditions.  Rasters for these values (in GeoTiff and BIL formats) are available from http://worldclim.org.", longtable=TRUE )
```

For the purposes of this chapter, I'll use bioclim and altitude layers from tile 22, which encompasses the spatial distribution of sampling locations in Baja California for the *Araptus attenuatus* dataset

Common raster formats include GeoTiff, essentially an image file with some metadata associated with it, and BIL (Binary interleaved) file formats.  Both of these types are available from WorldClim.  In general, the  GeoTiff format is slightly easier to work with as all the data is contained within a single file, whereas the BIL format has two files for each raster (the second file is a header file that has the spatial meta data associated with it).  If you do use the BIL format, the file path you pass to `raster()` would be of the BIL file, not the header one.

From Worldclim, I downloaded the elevation raster for Tile 22 and can load it into R using the `raster()` function as:

```{r warning=FALSE, message=FALSE}
alt <- raster("./spatial_data/alt.tif")
alt
```

The `alt` object is summarized here.   A couple of things should be pointed out here:
- In total, this is an object with 12,960,000 entries in it!  
- The resolution of each 'pixel' in this representation is 0.008, which is about 30-arc seconds or ~1km. That means that each loation in the study area is represented by the same exact value as the surrounding square kilometer. Obviously, if you are working on processes whose spatial scale is relevant less than 1000 m^^2^^, this kind of data is going to be of little value to you. 
- The values within the matrix range from -202 upwards to 5469.  This is in meters.

In addition, the raster has a spatial extent and a projection associated with it.  For more information on projections see \@ref(map-projections).

```{r}
bbox( alt )
proj4string( alt )
```

This elevation raster looks like:

```{r}
plot( alt, xlab="Longitude", ylab="Latitude" )
```


## Cropping Rasters

Just because we have a large raster does not mean that it is in your best interest to use the entire object.  Much of the spatial analyses routines used in population genetics require measurements of intervening distance, either Euclidean or ecological.  Many of the routines for estimation of these distances require the estimation of pairwise distance between *all* pixels.  For our purposes, the `arapat` dataset does not occur throughout most of this map, so it is in our best interests to use only the portion of the raster relevant to our data rather than the entire thing.  

Here is one way of going this.  I first define an extent, which consists of a vector representing the coordinates for `xmin`, `xmax`, `ymin`, and `ymax` (in decimal degrees longitude and latitude).  You then `crop()` the raster to that extent.

```{r}
e <- extent( c(-115,-109,22,30) )
baja_california <- crop( alt, e )
plot(baja_california, xlab="Longitude",ylab="Latitude")
```

Lets make this base map a bit more pretty by taking the altitude and estimating the slope of each pixel and the direction it is facing (aspect).  From this, we can 'shade' the hills in the map giving it more of a relief view we commonly see in maps. 

```{r}
slope <- terrain( baja_california, opt="slope" )
aspect <- terrain( baja_california, opt="aspect")
baja <- hillShade( slope, aspect, 40, 270 )
plot(baja, xlab="Longitude",ylab="Latitude", legend=FALSE)
```

Onto this map, we can plot our populations.  For this, we convert the raw coordinates into a `SpatialPoints` object (see \@ref(vector-data)) and then overlay onto the map.  I use two `points()` commands to make the symbol for each population.

```{r echo=-1}
library(gstudio)
data(arapat)
coords <- strata_coordinates(arapat)
pts <- SpatialPoints( coords[,2:3], proj4string = CRS(proj4string(baja)))
plot(baja, xlab="Longitude",ylab="Latitude", legend=FALSE)
points( pts, col="darkred", pch=3)
points( pts, col="red", pch=16)
```

### Cropping Rasters Via Polygons

It is also possible to crop a raster with a more fine grained approach using a polygon.  Here is an example using five points picked around the region of Loreto, BCS (I just grabbed these by looking at Google Earth).  You define a polygon by a series of points, the last of which has to be identical to the first one so that the polygon is a closed object and not just a series of points on a crooked line...

```{r}
pts <- rbind( c(-111.5,27.0),
              c(-112.4,26.7),
              c(-111.7,25.7),
              c(-111.1,25.4),
              c(-110.8,26.0),
              c(-111.5,27.0) )
pts
```

From these points, we construct a `SpatialPolygons` object (see \@ref{polygons} for more info on this convoluted construction) and then can overlay onto the map to make sure it in the correct vicinity (here we are eyeballing it a bit).  For more on why this next line of code looks so crazy, see \@ref(polygons).

```{r echo=-1}
polys <- SpatialPolygons(list(Polygons(list(Polygon(pts)),"Polygon")))
plot(baja, legend=FALSE)
plot(polys, add=TRUE)
```

To use the polygon to crop the raster, we have to both remove the part of the raster that is not contained within the polygon (`mask`) and then cut down the remaining raster to change the bounding box to that representing the portion of the data that remains (`trim`).  If you do not trim the raster, it will have the same amount of data associated with it as the previous raster (e.g., the underlying data matrix will have 960 rows and 720 columns) but the part that is masked will be represented by `NA` values.  For many rasters, the data is held in-memory (see the entry for 'data source` in the summary above) and as such removing as much of a raster that is `NA` improves your ability to manipulate it better.

```{r, fig.cap="Extraction of "}
loredo <- trim( mask( baja, polys ) )
plot(loredo)
```


### Cropping Rasters Via Convex Hull


An analysis common to modern population genetics is that of finding ecological distances between objects on a landscape.  The estimation of pairwise distance derived from spatial data is a computationally intensive thing, one that if you are not careful will bring your laptop to its knees!  One way to mitigate this data problem is to use a minimal amount raster area so that the estimation of the underlying distance graph can be done on a smaller set of points.  

Cropping by a polygon like demonstrated in the previous example is a 'by hand' approach to estimating a box that roughly encompasses your data points. A more efficient one is one where you simply provide your coordinates and we can estimate a polygon that surrounds those coordinates with the minimal amount of wasted space.  This is called a *Convex Hull*, which is kind of like a polygon that is created as if there was a rubbrerband fit around all your points.  It is a minimal area that includes all of your points.  

For this example, I'm going to use the populations found along the peninsula and find the minimal area encompassing those points. 

```{r message=FALSE, warning=FALSE, echo=-1}
library(gstudio)
data(arapat)
coords <- strata_coordinates(arapat)
baja_coords <- coords[ !(coords$Stratum %in% c("101","102","32")), ]
baja_pts <- SpatialPoints( baja_coords[,2:3])
plot(baja, legend=FALSE)
plot(baja_pts,add=T,col="darkred")
plot(baja_pts,add=T,col="darkred",pch=16)
```

The methods for finding the hull and adding a buffer around it are found in the `rgeos` package.  These are pretty easy functions to use and are very helpful.  If you are having trouble installing the `rgeos` package from source, see \@ref(rgdal-rgeos-packages).

```{r message=FALSE,warning=FALSE, echo=-1}
library(rgeos) # loads in gConvexHull & gBuffer functions
hull <- gConvexHull(baja_pts)
plot(baja, legend=FALSE)
plot(baja_pts,add=T,col="darkred")
plot(baja_pts,add=T,col="darkred",pch=16)
plot(hull,add=T,border="red")
```

The function `gConvexHull()` returns an object of type *SpatialPolygons*, just like we created before.  However, we now have a polygon that has each of our most 'outward' populations on the very edge of the polygon.  It may be beneficial for us to add a buffer around this polygon.

```{r echo=-1}
hull_plus_buffer <- gBuffer(hull,width=.2)
plot(baja, legend=FALSE)
plot(baja_pts,add=T,col="darkred")
plot(baja_pts,add=T,col="darkred",pch=16)
plot(hull_plus_buffer, add=T, border="red")
```

Now, we can mask and trim it to include only the area of interest.

```{r echo=-1}
pop_hull <- trim( mask(baja,hull_plus_buffer) )
plot(pop_hull, legend=FALSE, xlab="Longitude", ylab="Latitude")
plot(baja_pts,add=T,col="darkred",pch=16)
```

This would be a great raster to start looking at ecological separation in since we have removed the extraneous data that would unintentionally cause problems with the distance estimators.

## Modifying Rasters

We can modify rasters just as easily as we can crop them.  They are matrices, after all.  The square bracket indexing you use for matrices are just as effective as before.  

In the next example, I mask the landscape based upon elevation.  I create a copy of the original raster and then make everything whose elevation is less than 500m as missing data.  Plotting this over the top of the original raster shows only locations where elevation exceeds this cutoff.  

```{r echo=-1}
bc <- baja_california
bc[ bc < 500 ] <- NA
plot( baja_california, legend=FALSE, col="darkgrey" )
plot( bc, add=TRUE, legend=FALSE)
```

## Extracting Point Data From Rasters

So far, rastes have been confined to representing a single static object.  However, it is not uncommon to need to query a raster and find out the values at particular points.  These points may be pre-defined or they may be dynamic (e.g., you need to point at a location on the map and determine the value there).

For queries of the first kind, we can use the `extract()` function.  For this I downloaded the average temperature and precipitation rasters from Worldclim.

```{r}
baja_temp <- raster("./spatial_data/bio1.tif")
baja_prec <- raster("./spatial_data/bio12.tif")
```

And then extract the values from each of these layers into the `coords` data we already have set up.

```{r}
coords$elevation <- extract( baja_california, coords[,c(2,3)])
coords$mean_temp <- extract( baja_temp, coords[,c(2,3)])
coords$mean_precip <- extract( baja_prec, coords[,c(2,3)])
coords[1:10,]
```

A note should be made on the temperature and precipitation values.  Temperature is denoted in tenths of a degree Celcius.  Though it does get quite hot at times, it does not average 188&deg;C at population 88!  Similarly, the units for precipitation are mm (or tenths of centimeters if you will...).


```{r warning=FALSE, message=FALSE}
library(ggrepel)
coords <- coords[ order(coords$Latitude),]
p <- ggplot( coords, aes(x=Latitude,y=elevation)) + geom_line(color="lightgrey") 
p <- p + geom_point() + ylab("Elevation (m)")
p + geom_text_repel(aes(label=Stratum), color="red") 
```

The package `ggrepel` provides a pseudo-smart labelling geometry for ggplot allowing you to have labels that are shifted around the points so as to maximize visibility.

For inquires of the second type, we can use the function `click()` to retrieve one of several outputs. Here is the help file that describes the various components. 

<div class="scrollingbox"><pre>
<pre>click {raster}	R Documentation
Query by clicking on a map

Description

Click on a map (plot) to get values of a Raster* or Spatial* object at that location; and optionally the coordinates and cell number of the location. For SpatialLines and SpatialPoints you need to click twice (draw a box).

Usage

## S4 method for signature 'Raster'
click(x, n=Inf, id=FALSE, xy=FALSE, cell=FALSE, type="n", show=TRUE, ...)

## S4 method for signature 'SpatialGrid'
click(x, n=1, id=FALSE, xy=FALSE, cell=FALSE, type="n", ...)

## S4 method for signature 'SpatialPolygons'
click(x, n=1, id=FALSE, xy=FALSE, type="n", ...)

## S4 method for signature 'SpatialLines'
click(x, ...)

## S4 method for signature 'SpatialPoints'
click(x, ...)
Arguments

x	- Raster*, or Spatial* object (or missing)
n	- number of clicks on the map
id - Logical. If TRUE, a numeric ID is shown on the map that corresponds to the row number of the output
xy - Logical. If TRUE, xy coordinates are included in the output
cell - Logical. If TRUE, cell numbers are included in the output
type - One of "n", "p", "l" or "o". If "p" or "o" the points are plotted; if "l" or "o" they are joined by lines. See ?locator
show - logical. Print the values after each click?
...	- additional graphics parameters used if type != "n" for plotting the locations. See ?locator

Value

The value(s) of x at the point(s) clicked on (or touched by the box drawn).

Note

The plot only provides the coordinates for a spatial query, the values are read from the Raster* or Spatial* object that is passed as an argument. Thus you can extract values from an object that has not been plotted, as long as it spatially overlaps with with the extent of the plot.

Unless the process is terminated prematurely values at at most n positions are determined. The identification process can be terminated by clicking the second mouse button and selecting 'Stop' from the menu, or from the 'Stop' menu on the graphics window.

See Also

select, drawExtent

Examples

r <- raster(system.file("external/test.grd", package="raster"))
#plot(r)
#click(r)
#now click on the plot (map)</pre></div>

Here is the output from a single inquire on the `baja_california` raster map.

```{r eval=FALSE}
plot( baja_california, legend=FALSE)
click(baja_california, xy=TRUE)
##           x        y value
## 1 -113.3875 27.82083   116
```


### Stacks of Rasters

It is not uncommon to be working with many different raster layers at the same time.  Instead of loading them individually, the `raster` library has a `RasterStack` object that can hold several rasters at one time and can be used in places where we would use individual rasters.  Here is an example using the elevation and temperature rasters for tile 22.

```{r}
files <- c("./spatial_data/alt.tif", "./spatial_data/bio1.tif", "./spatial_data/bio5.tif", "./spatial_data/bio6.tif")
bio_layers <- stack( files )
bio_layers
```

Performing operations on a stack is as easy as performing them on individual layers.  Here, I trim them to the hull defined above.

```{r}
e <- extent( c(-115,-109,22,30) )
bio_layers <- crop( bio_layers, e )
plot(bio_layers)
```

From which values may be extracted using normal methods as outlined above.

```{r}
df <- extract( bio_layers, baja_pts)
df <- df[ !is.na(df[,1]),]
head(df)
```

And visualized normally.

```{r}
library( GGally )
ggpairs( as.data.frame(df))
```


## Rasters & ggplot

As is the case with a lot of data types in R, there is a way to use the `ggplot` library to visualize rasters.  Essentially, what you need to do is to transform your raster objects into `data.frame` objects for `ggplot`'s `geom_tile()` function.  Here is an example.

```{r}
library(ggplot2)
df <- data.frame( rasterToPoints( baja_california )) 
names(df) <- c("Longitude","Latitude","Elevation")
p <- ggplot( df ) + geom_tile( aes(x=Longitude,y=Latitude,fill=Elevation)) 
p <- p + scale_fill_gradientn( colors=c('#a6611a','#dfc27d','#f5f5f5','#80cdc1','#018571'))
p <- p + coord_equal() + xlab("Longitude") + ylab("Latitude") 
p
```


As usual, we can add additional information to the plot and as we would for any other `ggplot` object.  Here I'll add the populations and indicate if they have samples in them that are of one species (Pure) or have a mix of the two (Mixed).

```{r}
num.clades <- colSums( table(arapat$Species, arapat$Population) > 0 )
Stratum=names(num.clades)
Species= factor( c("Pure","Mixed")[num.clades] )
tmp.df <- data.frame( Stratum, Species )
sites <- merge( coords, tmp.df )
p + geom_point( aes(x=Longitude,y=Latitude, shape=Species), size=3, data=sites )
```


## 3D Visualization

It is also possible to visualize rasters in 3-space.  The library `rasterVis` provides an interface to the `rgl` library to plot a surface.  Once installed, these are easy to use for viewing surfaces. Here is an example using the elevation data we have been playing with.

```{r eval=FALSE}
library(rasterVis)
plot3D( baja_california , zfac=0.1)
```
```{r echo=FALSE}
knitr::include_graphics("./media/Baja_Plot3D.png")
```

The `zfac` option in the plot is the amount to scale the z-axis (elevation) in relation to the x-axis and y-axis dimensions.  It is a bit exagerated at `zfac=0.1` but you get the idea.


<!-- TODO: Add Categorical Rasters -->

<!-- ## Categorical Rasters  -->

<!-- **ADD HERE** -->


## Saving & Exporting Rasters

There are many situations where you need to save a raster you've manipulated.  As these raster objects are R objects, you can save them directly to the file system using the `write()` as:

```{r eval=FALSE}
write( baja_california, filename="baja_california.rda")
```

This will save the raster object to file exactly like it is in your R session.  To load it back in you just use `load()` and it is returned just like it was.  The benefit to saving these as R objects is that you do not need to change it at all to pick up where you left off.

You may also need to export the raster in a non-R format for external analyses.  To do this, you use the `writeRaster()` function.  The file extension is used to determine the file format used and R saves it automatically.

```{r eval=FALSE}
writeRaster( baja_california, filename="baja_california.tif")
```


<!--chapter:end:raster_data.rmd-->

# Map Projections {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_butterfly2.jpg"></div>

A spatial projection is a mathematical representation of a coordinate space used to identify geospatial objects.  Because the earth is both non-flat and non-spheroid, we must use mathematical approaches to describe the shape of the earth in a coordinate space.  We do this using an ellipsoid—a simplified model of the shape of the earth.  Common ellipsoids include:  

- NAD27 (North American Datum of 1927) based upon land surveys  
- NAD83 based upon satellite data measuring the distance of the surface of the earth to the center of the plant.  This is also internationally known as GRS80 (Geodetic Reference System 1980) internationally.  
- WGS84 (World Geodetic System 1984) is a refinement of GRS80 done by the US military that was used in the development of GPS systems (and subsequently for all of us).

Onto this ellipsoid, we must define a set of reference locations (in 3-space) called datum that help describe the precise shape of the surface.


## Projections

A projection onto an ellipsoid is a way of converting the spherical coordinates, such as longitude and latitude, into 2-dimensional coordinates we can use.  There are three main types of approaches that have been used to develop various projections. (see wikipedia for some example imagery of different projections). 

These include:  

- Azimuthal:  An approach in which each region of the earth is projected onto a plane tangential to the surface, typically at the pole or equator. Cylindrical: This approach projects the surface of the earth onto a cylinder, which is ‘unrolled' like a large map.  This approach ‘stretches' distances in a east-west fashion, which is why Greenland looks so large...
- Conic: Another ‘unrolling' approach, though this time instead of a cylinder, it is projected onto a cone.  

All projections produce bias in area, distance, or shape (some do so in more than one), so there is no 'optimal' projection.  To give you an idea of the consequences of these projections, I'll use the United States map as an example and we can visualize how it is projected onto a 2-dimensional space using different projections.

### Equatorial Projections

These are projections centered on the Prime Meridian (`Longitude=0`) 

*Mercator Projection*
```{r projections1, warning=FALSE, message=FALSE}
library(maps)
map("state",proj="mercator")
```

*MollWeide Projection*
```{r}
map("state",proj="mollweide")
```

*Gilbert Projection*
```{r}
map("state",proj="gilbert")
```


*Cylequalarea Projection*

Some projections require additional parameters, this one is based upon equally spaced and straight meridians, equal area, and true centered on a particular Latitude.  I used the centroid of the US.

```{r}
map("state",proj="cylequalarea",par=39.83)
```

### Azimuth Projections

These projections are centered on the North Pole with parallels making concentric circles.  Meridians are equally spaced radial lines.

*Orthographic Projection*
```{r}
map("state",proj="orthographic")
```

*Stereographic Projection*
```{r}
map("state",proj="orthographic")
```

*Perspective Projection*

Here the parameter is the distance (in earth radii) the observer is looking.

```{r}
map("state",proj="perspective", param=8)
```

*Gnomonic Projection*
```{r}
map("state",proj="gnomonic")
```


### Polar Conic Projections

Here projections are symmetric around the Prime Meridian with parallel as segments of concentric circles with meridians being equally spaced.

```{r}
map("state",proj="conic",par=39.83)
```

```{r}
map("state",proj="lambert",par=c(30,40))
```


### Miscellaneous Projections

*Square Projection*
```{r}
map("state",proj="square")
```


*Hexagon Projection*
```{r}
map("state",proj="hex")
```

*Bicentric Projection*
```{r}
map("state",proj="bicentric", par=-98)
```

*Guyou Projection*
```{r}
map("state",proj="guyou")
```

There are a lot of ways to project a 3-dimensional surface onto a 2-dimensional representation.  Be aware of what you are using and how you are using it when plotting materials.

### Reprojecting Rasters

When working with rasters, we can reproject these onto other projections rather easily.  Here is an example from the worldclim elevation tile we used previously (see \@ref(cropping-rasters)).

```{r message=FALSE, warning=FALSE}
library(raster)
alt <- raster("./spatial_data/alt.tif")
e <- extent( c(-115,-109,22,30) )
baja_california <- crop( alt, e )
baja_california
```

We can now project it to another projection, lets say Lambert Conic Conformal.

```{r warning=FALSE,message=FALSE}
library(rgdal)
projection(baja_california)
baja_lcc <- projectRaster( baja_california, crs="+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +ellps=WGS84")
baja_lcc
```

These two projections influence the region as shown below.

```{r echo=FALSE}
par(mfrow=c(1,2))
plot(baja_california, legend=FALSE, xlab="Longitude",ylab="Latitude")
plot(baja_lcc, legend=FALSE )
par(mfrow=c(1,1))
```


### Projecting In GGPlot

As usual, there is probably a way to plot these values in `ggplot` to make the output just a little bit more awesome. Projections of data in `ggplot` displays can be manipulated by appending a `coord_*` object to the plot.  Here are two examples using a mercator and azimuth equal area projection of the state maps.

```{r message=FALSE, warning=FALSE}
library(ggplot2)
states <- map_data("state")
map <- ggplot( states, aes(x=long,y=lat,group=group))
map <- map + geom_polygon( fill="white",color="black")
map <- map + xlab("Longitude") + ylab("Latitude")
map + coord_map("mercator") 
```

Conversely, we can plot it using the equal area Azimuth projection

```{r}
map + coord_map("azequalarea")
```

or fisheye

```{r}
map + coord_map("fisheye",par=3)
```

or any other projection available listed in the `mapproject()` function.


## Coordinate Systems
In R, we typically are dealing with a combination of data that we've collected and that we've attained from some other provider.  In most GIS applications, the coordinate systems we encounter are either:  

- UTM (Universal Transverse Mercator) measuring the distance from the prime meridian for the x-coordinate and the distance from the equator (often called northing in the northern hemisphere) for the y-coordinate. These distances are in meters and the globe is divided into 60 zones, each of which is 6 degrees in width.
Geographic coordinate systems use longitude and latitude.  For historical purposes these are unfortunately reported in degrees, minutes, seconds, a temporal abstraction that is both annoying and a waste of time (IMHO).  
- Decimal degrees, while less easy to remember, are easier to work with in R.
- State Planar coordinate systems are coordinate systems that each US State has defined for their own purposes.  They are based upon some arbitrarily defined points of reference and another pain to use (IMHO).  Given the differential in state area, some states are also divided into different zones.  Maps you get from municipal agencies may be in this coordinate system.  If your study straddles different zones or even state lines, you have some work ahead of you...

It is best to use a system that is designed for your kind of work.  Do not, for example, use a state plane system outside of that state as you have bias associated with the distance away from the origin.  That said, Longitude/Latitude (decimal degrees) and UTM systems are probably the easiest to work with in R.  

In R, we use rgdal to project points.  Here I load in the coordinates of the populations in the *Arapatus attenuatus* data set and make a *SpatialPoints* object out of it.  Setting the `proj4string()` here **does not** project the data, I am just specifying that the data are already in the lat/long WGS84 format.

```{r message=FALSE, warning=FALSE}
library(sp)
library(gstudio)
data(arapat)
coords <- strata_coordinates( arapat )
pts <- SpatialPoints( coords[,2:3] )
proj4string(pts) <- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
pts
```

The `CRS()` function holds the definition of the projection and interfaces between the PROJ.4 and RGDAL libraries.  To project a set of data points into a new coordinate systems, we use `spTransform()` and pass it the definition of the new system to use.

### Changing Datum

We will first begin by looking at differences in the actual datum used to record the loation of plots.  Here I compare the decimal Longitude/Latitude we've used thus far with that from the Universal Transverse Mercator ([UTM](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system)).

```{r}
pts.utm <- spTransform(pts, CRS("+proj=utm +zone=12 +datum=WGS84"))
summary( pts.utm )
```

You can see the tranformations in the coordinate system by comparing the plots below.  The relative position of each point is the same.

```{r eval=TRUE,echo=FALSE}
par(mfcol=c(1,2))
plot( pts, axes=TRUE, xlab="Longitude", ylab="Latitude" )
plot( pts.utm, axes=TRUE, xlab="Easting",ylab="Northing" )
par(mfcol=c(1,1))
```


<!--chapter:end:projections.rmd-->

# Ecological Distance {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_mountain.jpg"></div>


Previously we have been focused on population separation in Euclidean space (e.g., physical distance).  However, populations may be functionally separated by features of the intervening landscape.  It is entirely possible that features existing between populations may influence the ability of gene flow to maintain connectivity between sites.  

```{r echo=FALSE, fig.cap="Schematics showing consequences of intervening habitat effects on connectivity.  A. Populations A and B separated in Euclidean space (black arrow) have constant resistance to connectivity (red line).  B. Populations separated by similar spatial separation (black arrow) have increased overall resistance to connectivity (red line) due to the presence of the intervening forest."}
knitr::include_graphics("./media/LandscapeResistance.png")
```

It can be argued that the *sine qua non* of applied population genetics (or landscape genetics if you like that term) is in the identification of the which subset of ecological or other features influence the connectivity of populations.  Throughout the history of population genetics, a lot of theory has been developed ignoring the effects of other organisms, the environment in which micro evolutionary processes are occurring, etc.  In the last few decades, the field has moved towards more inclusive, and site-aware, approaches and in this chapter we get into some of the ways in which you can include data available to you in both Vector (\@ref(vector-data)) and Raster (\@ref(raster-data)) formats.  

Fundamental to the newer ways we are analyzing our data is the concept of resistance.  While there may be many kinds of resistance, they are can all be described by a simple fundamental notion that there exists a quantitative value that can be assigned to the separation of locales on a landscape.  Consider the inset images in Figure \@ref(fig-legos), stylized using a Lego representation of myself and a tasty meal.  

```{r legos, echo=FALSE, fig.cap="Resistance as a fundamental concept in understanding population genetic structure.  Resistance, measured between spatial locations can be entirely spatial (isolation by disance), due to barriers (isolation due to barriers), a function of a complex intervening habitat (isolation by resistance), or any combination thereof (insets, clockwise from top)."}
knitr::include_graphics("media/characteristic_model.png")
```

These different external resistance examples represent:  

- *Isolation by Distance:*  Under this classic model (dating back almost 90 years now), the resistance value is defined by Euclidean distance between two sites. Essentially, the surface over which the individual is walking is based upon a uniform value (whatever the spatial unit being used) and is homogeneously spread across the intervening landscape.  Euclidean geometry provides the shortest path.  
- *Isolation by Barriers:* This is a model in which some discrete boundary on the landscape changes the resistance value (in this case it is more difficult to cross a river).  In a model like this, the shortest path between the person and the meal must cross the river and must do so in a way that minimizes the accumulated distances.  However, travel on each side of the barrier would accumulated resistance identical to that found under the IBD model (above).  
- *Isolation by Resistance (or habitat or whatever is the latest fad):*  Here the landscape consists of a heterogeneous mix of resistance values depending upon what features the individual must traverse.  Some may be impenitrable (a very large resistance value), some may be just accumulate larger than normal resistance values.

In all three of these cases, the underlying distance matrix (or raster in our representation) is estimated and filled with values.  For IBD, it is uniform in magnitude across the matrix.  Barriers have one or more spatially contiguous features with increased, or perhaps reduced if we consider them as corridors---the math is exactly the same.  For the more complicated scenario, the resistance matrix has a wide range of resistance values associated with indiviudal cells.  Despite the aparent differences, the underlying approahces are identical, we define a resistance matrix and then use some measure of separation to measure how far apart each of the points of interest are located.  The important point is that the fundamental concept of resistance underlies all these different 'models'.  In fact, you could argue (and you would be statistically correct in doing so) that even the Island Model and Stepping Stone models are resistance models.

As a demonstration of how we do the actual calculations in `R`, we will start with the venerable `arapat` data set.  Only a subset of populations will be used for the following reasons:  

1. The mainland populations are not connected, on the reduced raster, by land.  As such, we cannot find an ecological path from those populations to the rest on the Baja Peninsula.  It is possible to connect them across the top of the Sea of Cort\'ez but that is well beyond this species natural range, making it effectively isolated.
2. Population `ESan` is located on an island and as such has no direct connection on the rasters we will be using to the remaining populations.
3. The population `165` is located in a place that the WorldClim rasters do not record land.  The rasters used have a resolution of 30-arc seconds (~1km) and this population is located at a place where that granularity of that grid does not recognize land.

```{r mapArapatCoords, message=FALSE, warning=FALSE}
library(gstudio)
library(ggrepel)
library(ggmap)
data(arapat)
data <- arapat[ !(arapat$Population %in% c("101","102","32","ESan","165")) , ]
coords <- strata_coordinates( data )
map <- population_map(coords, zoom=7)
ggmap( map ) + geom_point( aes(x=Longitude,y=Latitude),data=coords) + 
  geom_text_repel( aes(x=Longitude,y=Latitude,label=Stratum), data=coords)
```

To start with, I will use the elevation raster and modifications thereof.  The ultimate goal here is to estimate ecological separation using a variety of algorithms and as such we need to minimize the size of the background raster as much as possible.  In the code below, I take the elevation raster and create the convex hull around the population locations with a smallish buffer around the points so that the outermost populations will not be on the very edge of the data. I further constrain this raster by cutting regions within which we have no evidence that the target species exist, in this case we have almost no records of it occurring above 700m.

```{r eco-convex-hull, warning=FALSE, message=FALSE}
library(rgeos) 
library(raster)
alt <- raster("./spatial_data/alt.tif")
baja_pts <- SpatialPoints( coords[,2:3])
hull <- gConvexHull(baja_pts)
hull_plus_buffer <- gBuffer(hull,width=.2)
alt <- trim( mask( alt, hull_plus_buffer))
alt[ alt > 700 ] <- NA
plot(alt, xlab="Longitude", ylab="Latitude")
```

To start with, we should probably examine the distribution of the variable we are working with in the locations at which the samples are taken.

```{r elevation}
coords$elevation <- extract( alt, coords[,2:3])
ggplot( coords, aes(x=elevation)) + geom_histogram( binwidth = 25 ) + 
  xlab("Elevation (m)") + ylab("Frequency")
```

We can compare this against the 'background' values that we find across the raster, even in places where our samples were not taken.  Deviations observed in the distribution of values sampled from our sites from that defined in the background suggest that our sampling was not entirely random see \@ref(niche-modeling) for a more complete discussion on how to evaluate the relative importance of particular background features in the distribution of your organism.

```{r}
s <- values( alt )
s <- s[ !is.na(s) ]
Category <- c(rep("Observed",nrow(coords)), rep("Background",length(s)))
Elevation <- c(coords$elevation, s)
df <- data.frame( Category,Elevation)
ggplot( df, aes(x=Elevation,fill=Category)) + geom_density( alpha=0.75 ) +  
  xlab("Elevation (m)") + ylab("Frequency")

```


## The Concept of Resistance

The way in which the intervening habitat may influence genetic connectivity can be both varied and complex.  In R we quantify spatial data using a `raster` format.  As discussed in \@ref(raster-data), a raster is essentially a matrix whose entries represent some value that has both spatial location and extent. 

```{r echo=FALSE, fig.cap="Example raster representing a landscape with two different habitat types on it (light and dark)."}
knitr::include_graphics("./media/RasterGrid.png")
```

Several approaches are available to parameterizing a cost surface.  These range from expert opinion (where someones opinion fills in the gaps), to externally defined variables, to approaches that consider entire ranges of resistance.

```{r parameterizedRaster, echo=FALSE, fig.cap="Dichotomous cost surface containing two features, one four times more difficult to traverse than the other."}
knitr::include_graphics("./media/RasterGridParameterized.png")
```

Independent of how the values are set (though this is a huge component which can be addressed outside the context of how this is done), the raster (as in Figure \@ref(fig:parameterizedRaster)) can be considered as a component of a potential hypothesis.  It is not a complete hypothesis because we must first decide how distances on that raster are estimated---the result of which will be the complete hypothesis.  We return to ways in which distances can be estimated below (\@ref(estimating-separation)).

```{r echo=FALSE, fig.cap="Individual resistance surfaces may be combined to produce interaction (multivariate) cost distance surfaces.  In this example, the categorial surface (left) is combined with a cost sufrace representing a cline (center raster) to produce via the addition operator a surface representing both (right)."}
knitr::include_graphics("./media/MultivariateCostRaster.png")
```


Once we have a set of cost surfaces defined, our goal is to determine if there are subsets which help to explain the observed genetic structure we find on the landscape (Figure \@ref(fig:EuclVsEcol)).

```{r EuclVsEcol, echo=FALSE, fig.cap="Two different cost surfaces, Euclidean distance (left) and the multivariate ecological distance derived perviously (right).  The extent to which each of these resistance hypotheses can explain the observed genetic distances (graphs over top"}
knitr::include_graphics("./media/EuclideanVsEcological.png")
```

No matter how we actually measure separation, we can think of cost surfaces as belonging to one of two mutually exclusive categories; absolute resistance and relative resistance.  Absolute resistance is invariant to source and destination whereas relative landscape resistance is depending upon starting location.


### Absolute Resistance

Absolute resistance is the condition by which landscape features provide an invariable source of resistance to movement.  Examples here may include physical barriers that exert the same magnitude of resistance to movement no matter where the initial and destination locations are found.  For example, in an urban environment, a building provides an absolute resistance with respect to plant pollinators---they cannot move through that object no matter which direction they are coming from.  

In the example below, I will use the slope of the terrain as a example under the assumption that there a level of steepness of the terrain that prevents movement of individuals between locales. It does not matter at which direction the organism approaches this steep part of the landscape (uphill, downhill, or cross slope), their movement is restricted in a uniform manner. 

```{r echo=-1}
slope <- terrain(alt, unit='degrees')
plot(slope)
plot(baja_pts,add=T)
```

If we look at the distribution of slope values we can see the vast majority of the lower part of Baja California has a rather gentle slope.  

```{r distribution-of-slope, fig.cap="Distribution of estimates for slope (in degrees) across all pixels in the altitude raster."}
df <- data.frame( rasterToPoints(slope))
ggplot( df, aes(x=slope)) + geom_histogram( binwidth=0.1) + 
  xlab("Slope (degrees)") + ylab("Frequency")
```

In the following examples, I'm going to use the distribution of values shown in Figure \@ref(fig:distribution-of-slope).  


```{block, note-slope-scale, type='comment_information'}
It is very important to note that the slope we are deriving here is based upon an average across a 30-arc second pixel size in the desert.  We are perhaps missing a lot of sub-kilometer slop-y areas that prevent movement.  However, this is an example meant more for the importance of providing the methodology of how you would do this than an expectation explaining biological relevance.
```

For the purposes of this example, I'll define three categories of slope.  The values returned by the `terrain()` function in this case are in degrees (though radians and tangent are also possible, see `?terrain`).  A new raster will be created, based slope, with the following (somewhat arbitrary but relevant as an example) categories:  

- Slope < 5$^{\circ}$ has no effect on movement.
- Slope at a location between 5$^{\circ}$ and 10$^{\circ}$ result in a movement resistance twice as difficult as the low/no slope condition.
- Slope in excess of 10$^{\circ}$ is categorized as four times as difficult for movement than the low/no slope areas on the map.


```{r}
tmp <- slope
tmp[ slope < 5 ] <- 1
tmp[ slope > 5 ] <- 2
tmp[ slope > 10 ] <- 4
cat_slope <- ratify(tmp, count=TRUE)
rat <- levels( cat_slope )[[1]]
rat$Slope_Type <- c("None","Low","High")
levels(cat_slope) <- rat 
cat_slope
```

For this raster, I set up the categories and define a *raster attribute table* (see \@ref(categorical-rasters)) to make the output more legible.  If we visualize this raster, we see the landscape partitioned into these three categories. 

```{r}
plot(cat_slope, legend=FALSE)
```

From here, we can proceed to estimating distances among points.  This absolute cost surface is just one example.  You could have used the raw value of slope (larger linearly becoming more difficult), a function of slope, or any variant of coming up with a metric to create the values in each pixel.  The point here is that this is an absolute cost, and all estimates of distance, independent of origin and destination, use the same cost raster.


## Relative Resistance

The other way in which rasters may be created are are based upon relative cost distances.  These cost surfaces are created, relative to some particular benchmark point on the surface.  For example, if you think that phenology (e.g., the timing of reproduction) may be an important component contributing to connectivity among populations (as is the case for plant species), it is not the absolute value of elevation that is of concern across the landscape.  Rather it is the similarity in elevation that is the driver---plant populations at the same elevations may tend to flower in higher synchrony than those separated across elevation gradients.  As such, when we estimate cost surfaces for something like this, we would focus on on the magnitude of elevation but on the absolute difference in elevation between points.  Under this *relative* cost approach, the distance from each locale would be based upon the deviance in elevation from that point to the rest of the landscape.  

I'll use this concept of elevation similarity in this example, starting with the elevation of population `163`, which will be the benchmark.

```{r}
idx <- which( coords$Stratum == "163")
target_elevation <- coords$elevation[ idx ]
target_elevation
```

From this target elevation, we can create an elevation raster that is based upon the deviation between the benchmark and everyone else.  Since this is a cost surface we are creating we need to perhaps include a base cost instead of having elevation clines with zero resistance.  For this, I add a unit cost to the entire raster.

```{r}
tmp <- alt - target_elevation
tmp <- abs( tmp ) + 1
plot( tmp )
```

For subsequent analyse, the distance between this specific locale and the remaining can be estimated using this raster.  However, if we are to move to the next population, the distance between that locale and this one would need to be based on a new elevation benchmark.  This means that for $K$ populations, pairwise estimates of relative distance need to be estimated from $K$ different rasters^[Assuming of course that each of your $K$ locale are at a unique elevation, if some are at the same elevation then the derived rasters will be identical.]!  


## Estimating Separation

Once we have defined one or more rasters quantifying resistance costs, we can go and estimate the distance among points on those rasters.  There are several ways available for us to estimate distance.  The differences in these methods are based upon the algorithm used to estimate separation and the extent of the raster used to apply these algorithms.

### Least Cost Distance

A least cost path is the route between two points, say $A \to B$, whose measured cost is minimal.  This is the *optimal* route across the landscape.  Organisms or vectors of gene exchange may, or may not, move across a landscape following an *optimal* path.  An example shortest path is shown in Figure \@ref(fig:lcp-example).

```{r lcp-example, echo=FALSE, fig.cap="Estimation algorithm for least-cost path distance.  The path chosen is the one with the shortest overall distance, or at least one of the paths with equally short overall distances."}
knitr::include_graphics("./media/LCP.png")
```

One way to estimate distances in R, is through the use of the `gdistance` library.  In this approach, we define a transition object based upon:  
1. The cost distance raster.  By default, the `transition()` function works on conductance, which is the inverse of resistance.  In these examples, we have used a single raster, though use of *RasterBrick* objects is just as appropriate.
2. The function by which we estimate the pixel-to-pixel distances.  
3. The neighborhood around each pixel that we look at during each step.  Options for this include:  
  - A von Neumann neighborhood (`directions=4`) consists of the pixels immediately above, below, and on both sides of the target pixel.  
  - A Moore's neighborhood (`directions=8`) consisting of all pixels surrounding the target pixel (e.g., even the diagonal ones)
  - A Knight & One-Cell Queen move (`directions=16`), which adds the next layer of cells outside of Moore's neighborhood.

Once estimated, the transition object *must* be corrected for if you are either using a large extent based upon Latitude and Longitude datum (e.g., they are not unit-wise consistent in area), or you have used a `direction` option other than the von Neuman Neighborhood.

In the example below, I use the cost distance estimated based upon similarity in elevation, under the hypothesis that phenological synchrony may influence connectivity.

```{r message=FALSE, warning=FALSE}
library(gdistance)
tr <- transition( 1/tmp, transitionFunction = mean, directions = 4 )
tr <- geoCorrection( tr, type="c", multpl=FALSE, scl=FALSE)
tr
```

From this, potentially corrected, transition object we can estimate shortest path distance using the aptly named function `shortestPath()`, passing it the transition object, the point from which we are starting, and the destination, and an optional parameter on how we want the output returned.  In the example below, the results is returned as a *SpatialLines* object which I use to plot.

```{r}
path.1 <- shortestPath( tr, baja_pts[26], baja_pts[15], output="SpatialLines")
plot( tmp , xlab="Longitude", ylab="Latitude")
lines( path.1, col="red")
points( coords[c(26,15),2:3],pch=16, col="red")
```

As explained above, relative cost distances may create asymmetric distances.  If the shortest path between the two populations is taken in the opposite direction, we get different shortest path across the landscape.

```{r}
tmp <- alt
tmp <- abs( tmp - coords$elevation[ 15 ] ) + 1
tr <- transition( 1/tmp, transitionFunction = mean, directions = 4 )
tr <- geoCorrection( tr, type="r", multpl=FALSE, scl=FALSE)
path.2 <- shortestPath(tr,baja_pts[15],baja_pts[26],output="SpatialLines")
plot( alt , xlab="Longitude", ylab="Latitude")
lines( path.1, col="red")
lines( path.2, col="blue")
points( coords[c(26,15),2:3],pch=16, col="red")
```

Not only is it a different path, they are also different lengths.

```{r}
c( SpatialLinesLengths( path.1 ), SpatialLinesLengths( path.2 ) )
```


### All Paths (Circuit) Distance

Organisms may not move in an *optimal* way across the landscape but may move along several different paths, most of which are *unoptimal*.  The relative proportion of these alternative paths and their lengths may be combined to produce a distance between sites based upon all potential paths, $A \Rightarrow B$.  In the context of genetic connectivity, this method is referred to by the name of one of the software packages available to estimate it, [Circuitscape](http://circuitscape.org) by McRae (2006).  An example is shown in Figure \@ref(fig:ct-example).

```{r ct-example, echo=FALSE, fig.cap="Estimation algorithm for all-paths (circuit) distance.  The estimated distance between points $A$ and $B$ is based upon the length and relative frequencies of all potential paths across the landscape, even thouse that traverse more resistant portions of the landscape (highlighted in yellow)."}
knitr::include_graphics("./media/CT.png")
```

Using as an example, you can apply this algorithm to the resistance surface based upon relative deviation from population 93 (as in the last example). Since this method uses *all* paths that leave the the target site and can arrive at the destination location, you cannot produce a single path example.  What you do get from it is a spatial current raster from that site (e.g., the spatial flow away from it) as shown in Figure \@ref(fig:circuitscape-from-93). 

```{r circuitscape-from-93, echo=FALSE, fig.cap="Estimation of current map based upon relative cost distance from conditions observed at site 93.", echo=-(2:3)}
r <- raster("media/Baja_Out_curmap_15.asc")
plot( log10(r), legend=FALSE, xlab="Longitude", ylab="Latitude" )
plot( baja_pts[15], add=TRUE )
```


### Path & Corridor Constraints

Both least-cost, and all-paths approaches consider the entire cost surface in the estimation of distances. This is a *very important* point that needs to be made to emphasize some of the constraints that may be imposed on your actual estimation process.  

These approaches use graph-theoretic algorithms to estimate the length of various paths across the cost surface.  To do this, each pixel is treated as a node in a network, only after which the length of the paths may be estimated.  This is why it is important to limit the potential size of your background cost raster to an area that includes your sampling sites but not regions through which connectivity will not traverse.  For example, there is a very large number of paths that exist outside the convex hull displayed in Figure \@ref(fig:eco-convex-hull), the overwhelming majority of which have absolutely no influence on connectivity among the sites we are interested in studying.  If you do not restrict the cost raster as we did, the extra pixels are added network that needs to be constructed to estimate distance, independent of the fact that they may not be a member of the shortest (or even not longest) paths contributing to connectivity.  Numerically, algorithms such as Dijkstra'a (one for finding the single shortest path), scale in time with the number of nodes as show below (Figure \@ref(fig:dijkstra-time)).  

```{r dijkstra-time, echo=FALSE}
minNodes <- 2
maxNodes <- 100
stepNodes <- 2
v <- seq(minNodes,maxNodes,by=stepNodes)
e <- v* (v-1)/2
df <- data.frame( Pixels=v, Time=e+v*log10(v))
ggplot(df, aes(Pixels,Time)) + geom_line() + geom_point() + xlab("Number of Pixels in Raster") + ylab("Time Complexity")
```

Computer scientists use the notion of time complexity, a measure defined as the number of steps in the code required to complete the task.  This algorithm is described as $O(|e| + |v| \log |v|$, where $|e|$ is the number of edges and $|v|$ is the number of nodes.  In our example, the number of nodes, $|v|$ is how many pixels are in the landscape and the number of edges is $|v|(|v|-1)/2$!  This gets big quickly so anything you can do to limit the number of potential pixels in your landscape is important.

In addition to limiting the size of the raster outside your study area, one can also limit the intervening areas by imposing some constraints on where these paths may be found.  A common approach here is to define corridors between sites.  These corridors have a restricted width through which these paths may traverse (or at least through which the algorithms may look for these paths).  The width of these corridors may be either defined *a priori* or determined as a component of the analysis itself.


## Estimating Correlations in Distance

In the end, we are interested in getting to the pairwise distances among all sites, we extend the paths approaches as we outlined above.  For these examples, I'll use the categorical slope example and use both the least cost and the all-paths approaches.  Here is the raster.

```{r fig.cap="Cost raster for elevation based upon slope categories across regions for all peninsular Baja California populations.", echo=FALSE}
plot(cat_slope, legend=FALSE, xlab="Longitude",ylab="Latitude", 
     main="Categorical Slope Resistance")
plot( baja_pts, add=TRUE )
```


### Pairwise Least Cost Path 

For the least cost approach, we need to grab the transition objects and perform a `geoCorrection` because we are using Longitude & Latitude based coordinates.

```{r}
tr <- transition( 1/cat_slope, transitionFunction = mean, directions = 4 )
tr <- geoCorrection( tr, type="c", multpl=FALSE, scl=FALSE)
```

The corrected transition object can be used to estimate the shortest paths among all points using the `costDistance()` function.

```{r}
eDist <- costDistance( tr, baja_pts )
eDist <- as.matrix( eDist )
rownames(eDist) <- colnames(eDist) <- coords$Stratum
eDist[1:10,1:10] 
```

This returns a `dist` object, which I translate into a matrix and you can plot the corresponding relationship with genetic distance (I'll use Nei's distance as the example).

```{r}
gDist <- genetic_distance(data,stratum="Population", mode="nei")
df <- data.frame( Genetic_Distance = gDist[lower.tri(gDist)],
                  Slope_Distance = eDist[ lower.tri(eDist)])
df <- df[ !is.infinite(df$Slope_Distance),]
ggplot(df,aes(x=Slope_Distance,y=Genetic_Distance)) + geom_point() + 
  stat_smooth(method=lm, formula = y ~ x) + xlab("Least Cost Distance (Slope Categories)") + ylab("Neis Distance")
```


From these data, we can estimate the correlation (a rough estimate).

```{r}
cor(df$Genetic_Distance, df$Slope_Distance)
```

As we see, it is not a very large correlation.  If we were to attempt to ascertain significance for these values, we would want to use a `mantel()` approach rather than one using a normal Pearson approximation for significance due to the lack of independence.  However, there is no real need to test significance here since the correlation is so small, even if this was significantly different than zero the biological consequences are minimal.


### Pairwise All Pairs 

For an all-pairs (circuit) approach, you can do it in R or use the [Circuitscape](http://www.circuitscape.org) software.  I'll use the latter approach and illustrate the process so that I can demonstrate how you could produce the required input file formats.  The input required for Circuitscape consist of a resistance raster and the coordinates of all the sites.  For the cost raster, you can write it using `writeRaster()`.  An easy input format for both this raster and the coordinates is that of an ASCII raster.  Here is how the cost distance is exported to files.

```{r eval=FALSE}
writeRaster( cat_slope, filename="CategoricalSlope.asc")
```

And the sites can be done as well in ASCII raster format as by using the `rasterize()` function turning a set of points into a raster with the same extent as the raster you passed.

```{r eval=FALSE}
sites <- resterize(baja_pts, cat_slope )
writeRaster(sites,filename="sites.asc")
```

These are input directly into the Circuitscape input. The output from the analysis is presented as a set of columns in a text file (row, col, resistance).  Using the same input data as above produced the following:

```{r}
data <- read.table("media/BajaParwise_resistances_3columns",header=FALSE, sep=" ")
ctDist <- matrix(0,nrow=34,ncol=34)
for( i in 1:nrow(data)){
  ctDist[ data$V1[i], data$V2[i] ] <- data$V3[i]
}
ctDist <- ctDist + t(ctDist)
ctDist[1:5,1:5]
```

Pairs of points that are unreachable are encoded as -1, I'll change that to `NA`.

```{r}
ctDist[ ctDist == -1 ] <- NA
ctDist[ is.infinite(ctDist)] <- NA
ctDist[1:5,1:5]
```

As before, we can plot these against each other.

```{r}
df1 <- data.frame( Genetic_Distance = gDist[ lower.tri(gDist) ],
                  Circuit_Distance = ctDist[ lower.tri(ctDist)])
df1 <- df1[ !is.na(df1$Circuit_Distance),]
ggplot(df1,aes(x=Circuit_Distance,y=Genetic_Distance)) + geom_point() + 
  stat_smooth(method=lm, formula = y ~ x) + xlab("Circuit Distance (Slope)") + ylab("Nei's Genetic Distance")
```

If you do these analyses in R using `gdistance`, you need to adjust the arguments to both `transition` and `geoCorrection` for using an all-paths approach (see the documentation on these functions) and you can get pairwise distances from the `commuteDistance()` function.

```{r}
cor( df1$Genetic_Distance, df1$Circuit_Distance )
```

#### Similarity in Outcome

A pertinent question revolves around the extent to which 

```{r}
data <- data.frame(Slope=df$Slope_Distance, Circuit=df1$Circuit_Distance )
ggplot(data,aes(x=Slope,y=Circuit)) + geom_point() + 
  stat_smooth() + xlab("Least Cost Distance") + ylab("Circuit Distance")
```


As a first pass, we can look at the magnitude of these distance estimates.

```{r}
cor.test(data$Slope, data$Circuit, method = "spearman")
```


<!--chapter:end:ecological_distance.rmd-->

# Niche Modeling {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_bird.jpg"></div>


```{r eval=FALSE}
install.packages(c("rJava","dismo"))
library("rJava")
library("dismo")
```

Some of the more common approaches used to characterize a niche is through the use of either logistic regression approaches or maximum entropy.  At present, the implementation of the latter appoach, dentoed as MaxEnt, is widely used and will be highlighted in this section.  For you to use this, you need to [Download](http://www.cs.princeton.edu/~schapire/maxent/) the java interface for MaxEnt.  You will have to sign in to download the files.  Use the *.zip* archive for this, it has all the files you need.  This is *not* the GUI interface, it is only the underlying java machinery the GUI uses.  On some platforms, you may *also* need to download the *legacy* Java runtime (as of the writing of this it was SE 6) for your computer to run the analysis.  The underlying analyses are done in Java and only accessable to R using this approach.

Unzip those files and put each of them in the directory defined by the following command:

```{r chunk-niche}
system.file("java", package="dismo")
```

This puts the java executables into a directory within `dismo`.  This allows the `dismo` functions to use the latest version of maxent directly.  To get started, allow the java executables to access up to 1Gig of your RAM by setting the following option *prior* to loading in `dismo`

```{r message=FALSE, warning=FALSE}
options(java.parameters = "-Xmx1g" )
library(dismo)
```

Now we can get to work.  In this example, I'll use biolayers found to be informative for the distribution of *Euphorbia lomelii* (Euphrobiaceae), the host plant for *Arapatus attenuatus* that we collected from both observations and all available herbarium specimen records that had sufficient spatial specificity.

I load these in using the `raster()` function.  These rasters have already been cropped to the extent enclosing all the arapat populations (as in \@ref(cropping-rasters)).  I have these stored in a subdirectory called 'spatial_data' and load them in as a *RasterStack* object (a set of rasters).

```{r message=FALSE, warning=FALSE}
library(raster)
files <- c( "bio2.tif", "bio7.tif", "bio8.tif", "bio13.tif", 
            "bio15.tif", "bio17.tif")
files <- paste("./spatial_data",files,sep="/")
bio_layers <- stack(files)
```

Next, I load in the spatial coordinates of the recorded sites of known occurrence.  I set aside 20% of the observed sites to use as a test of the model we derive.

```{r}
sites <- read.csv("./spatial_data/EuphorbiaAllLocations.csv",header=TRUE)
num_train <- round(nrow(sites)*.8)
idx <- sample( 1:nrow(sites), size=nrow(sites), replace=FALSE )
pts.train <- sites[ idx[1:num_train] , 2:3]
pts.test <- sites[ idx[ (num_train+1):length(idx)], 2:3]
c( Train=nrow(pts.train), Test=nrow(pts.test) )
```

The maxent model itself is used determine the extent to which site-specific factors may be able to predict the presence of this species.  

```{r cache=TRUE}
fit <- maxent(bio_layers,pts.train)
```

The *html* output from the analysis can be viewed in your default browser by showing the model.

```{r eval=FALSE}
fit
```

Which produces output that looks something like the following inset.

<iframe width="100%" height="480px" src="media/maxent_euphorbia.htm" frameborder="1" allowfullscreen></iframe>

Here are some quick ways to view and interpret output from the maxent approach.  

```{r}
plot(fit)
```


```{r message=FALSE}
response(fit)
```


```{r}
r <- predict( fit, bio_layers )
plot(r, xlab="Longitude",ylab="Latitude")
points( pts.train, pch=3, cex=0.75)
points( pts.train, pch=16, cex=0.75)
```


```{r}
pts.random <- randomPoints(bio_layers, 1000)
fit.eval <- evaluate(fit, p=pts.test, a=pts.random, x=bio_layers)
fit.eval
```


```{r}
vals.pred <- data.frame( extract( bio_layers, pts.test) )
vals.rand <- data.frame( extract( bio_layers, pts.random) )
fit.eval_rnd <- evaluate(fit, p=vals.pred, a=vals.rand)
fit.eval_rnd
```

```{r}
library(ggplot2)

df <- data.frame( Val=NA, BioLayer=NA, Category=NA )

layers <- names(vals.pred)
for( layer in layers){
  Val <- c( vals.pred[[layer]], vals.rand[[layer]] )
  Category <- c( rep("Observed",nrow(vals.pred)), rep("Background",nrow(vals.rand)))
  df <- rbind( df, data.frame( Val, BioLayer=layer, Category))
}
df$BioLayer <- factor( df$BioLayer, ordered=TRUE, 
                       levels = names(vals.pred)[c(1,2,5,6,3,4)])
                                                                              

df$Category <- factor( df$Category )
df <- df[ !is.na(df$Val),]
p <- ggplot(df,aes(x=Val, fill=Category)) + geom_density(alpha=0.75)  
p <- p + facet_wrap(~BioLayer, nrow=3, scale="free") 
p <- p + scale_fill_brewer(type="qual",palette=3) 
p + xlab("Biolayer Value") + ylab("Density") 
```


<!--chapter:end:niche_modeling.rmd-->

# Eigenvector Maps {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_pond.jpg"></div>

There are several kinds of 'spatial structure' that may hinder our ability to use statistical models to uncover evolutionary processes—one of which is autocorrelation.  As an example if this phenomenon, consider the *Cornus florida* Cornaceae data set included in the `gstudio` package.  In this temperate understory tree species, seeds are dispersed in the vicinity of the maternal individual but pollen may be disbursed widely across the landscape.  Functionally, this means that if we look at genetic similarity of individuals as a function of inter-individual distance, there will be a positive correlation between genetic and physical distances at close proximity because they are more likely to share the same maternal individual.    


```{r dogwood_coords, fig.cap="Spatial distribution of adult <i>Cornus florida</i> trees on the landscape."}
library(gstudio)
library(ggplot2)
cornus <- read_population("./spatial_data/Cornus.csv",type="column", locus.columns = 5:14)
cornus_plot <- ggplot(cornus, aes(x=X.Coordinate, y=Y.Coordinate)) + 
  geom_point(alpha=0.5) + coord_equal() + 
  xlab("X Coordiante") + ylab("Y Coordinate") + 
  theme_bw() + coord_equal()
cornus_plot
```

This is a common feature in plant populations.  Limitations to seed dispersal are generally more severe than spatial limitations in pollen dispersal.  As such, proximate individuals are positively correlated due to potential co-ancestry.  If the dispersal processes are reversed, this is not so much of an issue.  Pollen is from near neighbors but seeds are randomly distributed---the spatial location of individuals is independent of genetic correlations.  As such, pollen and seed have different contributions to spatial structure, limitations in pollen dispersal can only enhance structure caused by limitations in seed dispersal.  So for the individuals in the map how does this manifest?  We can estimate inter-individual distances for both physical and genetic separation (n.b., I use inter-individual AMOVA distance here) and then look for any relationship.

```{r cache=TRUE}
coords <- strata_coordinates(cornus, stratum = "SampleID", longitude = "X.Coordinate", latitude="Y.Coordinate")
P <- strata_distance(coords, mode="Euclidean")
G <- genetic_distance(cornus, mode="AMOVA")
```

We can test for positive spatial autocorrelation.  Here I use the approach of Smouse & Peakall (1999) that fits within the normal AMOVA framework.  Normally, we examine the relationship between physical and genetic distance as in a spatial autocorrelation process.  Overall, a trend suggests a limitation in overall combined dispersal.   

```{r}
df <- data.frame( Physical=P[lower.tri(P)], Genetic=G[lower.tri(G)])
ggplot( df, aes(Physical,Genetic)) + geom_point() + stat_smooth(method="gam")
```

Here the relationship between physical and genetic distances are taken with respect to all the data across all the distances. In this case the overall correlation (if that is even appropriate) is $\rho =$ `r cor(df$Physical,df$Genetic)`.  However, we expect the relationship to be asymptotic in that at certain distances there should be some relationship and at larger distances it should be roughly random.  Here are the data separated by just up to 25 units.

```{r}
ggplot( df[ df$Physical < 25,], aes(Physical,Genetic)) + geom_point() + stat_smooth(method="loess") + geom_jitter()
```

which gives a different overall correlation ( $\rho =$ `r cor( df$Physical[ df$Physical < 25], df$Genetic[ df$Physical < 25])`).  Quite different!  

To proceed, we must thus define the spatial bins in which we can categorize individuals and estimate genetic correlations.  We can test the significance of the estimator by permuting individuals across distance classes and re-estimating the parameter.  These data can be plot as:

```{r cache=TRUE}
df <- genetic_autocorrelation(P,G,bins=seq(0,1000,by=100),perms=999)
df$Significant <- df$P <= 0.05
ggplot( df, aes(x=To,y=R)) + geom_line() + 
  geom_point( aes(color=Significant), size=4) + 
  geom_abline(slope=0,intercept=0, linetype=2) +
  xlab("Physical Separation") + ylab("Genetic Correlation")
```

Showing that for the first three distance lags, there is a significantly positive (at $\alpha=0.05$) correlation between individuals.  After that distance, it levels out and we could consider it to be random.  The parameter estimated for each distance class is a true correlation and is defined as:

\[
r^h = \frac{\sum_{i\ne j}^K x^h_{ij}c^h_{ij}}{\sum_{i = 1}^K x^h_{ii}c^h_{ii}}
\]

But what if we want to get a hold of this spatial component in our data and perhaps even extract it out so it does not bias our estimates? 

What happens if we change the size of our `bins`?  

```{r fourway-autocorrelation-plot, cache=TRUE, fig.cap="Spatial autocorrelation in the <i>Cornus</i> data set using different bin sizes."}
library(cowplot)
df1 <- genetic_autocorrelation(P,G,bins=seq(0,1000,by=25),perms=999)
df2 <- genetic_autocorrelation(P,G,bins=seq(0,1000,by=50),perms=999)
df4 <- genetic_autocorrelation(P,G,bins=seq(0,1000,by=200),perms=999)
df1$Significant <- df1$P < 0.05
df2$Significant <- df2$P < 0.05
df4$Significant <- df4$P < 0.05
p1 <- ggplot( df1, aes(x=To,y=R)) + geom_line() + 
  geom_point( aes(color=Significant), size=4) + 
  geom_abline(slope=0,intercept=0, linetype=2) +
  ylab("Genetic Correlation") + xlab("") + 
  theme(legend.position="none") + xlim(0,1000)
p2 <- ggplot( df2, aes(x=To,y=R)) + geom_line() + 
  geom_point( aes(color=Significant), size=4) + 
  geom_abline(slope=0,intercept=0, linetype=2) +
  xlab("") + ylab("") + 
  theme(legend.position="none") + xlim(0,1000)
p3 <- ggplot( df, aes(x=To,y=R)) + geom_line() + 
  geom_point( aes(color=Significant), size=4) + 
  geom_abline(slope=0,intercept=0, linetype=2) +
  xlab("Physical Separation") + ylab("Genetic Correlation") + 
  theme(legend.position="none") + xlim(0,1000)
p4 <- ggplot( df4, aes(x=To,y=R)) + geom_line() + 
  geom_point( aes(color=Significant), size=4) + 
  geom_abline(slope=0,intercept=0, linetype=2) +
  xlab("Physical Separation") + ylab("")+ 
  theme(legend.position="none") + xlim(0,1000)
plot_grid(p1,p2,p3,p4,ncol=2)
```

So what does this mean?  Can there be many different spatial patterns nested within our populations?  How can we evaluate which spatial scale has some kind of genetic autocorrelation?  And more importantly, how can we regress it out of our data so we can get to asking the real questions we are trying to ask?

## Eigenvectors & Eigenvalues

Both eigenvectors and their matched eigenvalues are somewhat mysterious mathematical creations that we ran across already in the section on ordination.  The general eigen equation is given as:

$$
\mathbf{A} \vec{v_i} =  \lambda_i \vec{v_i}
$$

where $\mathbf{A}$ is a matrix, $\vec{v}_i$ is the $i^{th}$ eigenvector and is associated directly with the $i^{th}$ eigenvalue $\lambda_i$.  For any matrix, the number of corresponding eigenvectors and associated eigenvalues will be equal to the minimum dimensionality of $\mathbf{A}$.  In most cases, we will be using symmetric matrices so our row and column count will be equal.  However, we will have at most $det(\mathbf{A})$ non-imaginary eigenvalues associated with any matrix.  

One way to think about eigenvalue/eigenvector pairs is via its use in spectral matrix decomposition.  Just like any number or polynomial equation, a matric can be factored into an additive set of matrices.  These matrices can be further defined as an eigenvector product scaled by the eigenvalue.  For example, the matrix $\mathbf{A}$ can be decomposed into:

$$
\begin{aligned}
\mathbf{A} &= \sum_{i=1}^\ell \mathbf{B_i} \\
&= \sum_{i=1}^\ell \vec{e_i}\vec{e_i}^\prime \lambda_i
\end{aligned}
$$

where the vector $\vec{e_i}$ (and its transpose-the one with the prime on it) create the matrix $\mathbf{B_i}$ after the scaling factor $\lambda_i$.  This decomposition is esstentially what we are doing and the meaning of the $\vec{e}$ and associated $\lambda$ values depend upon how the original matrix $\mathbf{A}$ is constructed.

For distance matrices, we can extract certain kinds of eigenvalue/eigenvector pairs that have interesting interpretations and aid in our understanding of latent spatial structure.  The next few sections show a few different ways we can determine the scales at which spatial structure may exist as well as how to correct for it.  Before we go into it, we should clarify a few definitions.  What we call the decomposition of data using eigenvalue/eigenvector has several different names, depending upon what kind of data you are starting from.  If we are to start from a raw set of data, say $N$ observartions with $p$ variables on each observation, say the coordiante data from the dogwood trees above:

```{r}
X = as.matrix( coords[,2:3])
dim(X)
```

decomposing this in into eigenvalue and eigenvector pairs is called *Principal Components Analysis* (PCA).  

```{r}
fit <- princomp(X)
fit
```

You can see that we get as many axes as the minimum of independnet rows or columns.  Here we have `r nrow(coords)` observations and two coordinates leaving us two principal components (denoted as `Comp.1` and `Comp.2` in the output above).  The `princomp()` rotation is essentially a method to take your original data in 2, 3, or as many dimensions as you have, and create a new synthetic set of data vectors based upon linear combinations of the original ones.  These new variables are derived such that they maximize the amount of variation in the data along sucessive axes---the first axis is constructed such that the variation of data along it is the largest, the second synthetic axis is has the second largest amount of vatiation while being at a right angle to the first one, etc.  Each of these orthoginal axes are ordered in decreasing magnitude of how much of the vatiation they describe in the original data.  This is depicted below in the image showing a rotation of the original data onto new synthetic axes.

```{r pca_rotation, echo=FALSE, fig.cap="Principal component rotation of bivariate data mapped from original axes (black X- and Y- axes) onto new coordinate axes (dotted red) that maximize the variation on each of the new axes (displayed as density insets for original and tranlated axes)."}
knitr::include_graphics("./media/PCA_Rotation.png")
```

The original data for each observation, $\vec{x_i}$, is translated into the new coordinate space, $\vec{\hat{x}_i}$ by multiplying it by the first eigenvector.

$$
\vec{\hat{x}} = \vec{e}\vec{x_i}
$$

To see this rotation, we can 

```{r message=FALSE, error=FALSE, fig.cap="The (A) Original, and (B) rotated coordiantes for the <i>Cornus<i> data set."}
library(cowplot)
cornus_rot <- data.frame(predict(fit));
cornus_rot_plot <- ggplot(cornus_rot, aes(x=Comp.1, y=Comp.2)) + 
  geom_point(alpha=0.5) + coord_equal() + 
  xlab("X Coordiante") + ylab("Y Coordinate") + 
  theme_bw() + coord_equal()
plot_grid( cornus_plot, cornus_rot_plot, rel_widths = c(2,3), labels=c("A","B"))
```

If we work on a derivative of the original data, say we take the coordinates and measure the distances between them (the `P` matrix from the code above), *and then* decompose it into eigenvalue/eigenvector pairs, it is called a *Principal Coordinate Analysis*. 

```{r}
pca_P <- princomp(P)
cat("PC Axes: ",length(pca_P$sdev))
```

Both the *Component* and *Coordiante* approaches do the same thing--decompose the matrix into a set of eigenvector/eigenvalue pairs--they just start with different original data.  With this out of the way, we can start playing around with the 


## Principal Coordinates of Neighborhood Matrices.

The distances between all the trees shown in the map above provide a common inter-individual distance matrix, $\mathbf{P}$.  Borcard & Legendre (2002) provided one approach for uncovering spatial patterning in a response data set (say $\mathbf{D}$ above) based upon a manipulation of the inter-locale distance matrix ($\mathbf{P}$) called Principal Coordinates of Neighborhood Matrices (PCNM).  What this approach does is take the spatial coordinates of indiviudals and decomposes the pair-wise distance matrix using a Principal Coordinate decomposition.  

1. Your data are defined as: $Y = f(X)$, along one dimension.  In this case, we have $N$ measured observations with a single response ($\vec{y}$) and $N$ corresponding predictor (spatial) variables ($\vec{x}$).
2. We define a pairwise inter-neighbor distance matrix of the predictors, $\mathbf{D}$.  This can be Euclidean, great-circle, or any other distance metric.
3. From this neighbor matrix, we set a maximum distance, $d_{max}$.  This distance is the size of the neighborhood we are interested in investigating.
4. In $\mathbf{D}$, set all values of the inter-observation distance that are equal to or greater than $d_{max}$ equal to $4*d_{max}$.  All values of $md_{max};\; m >= 4$ are suggested to be asymptotically similar in the results that follow.
5. Take a principal coordinate rotation on this matrix and retrain the eigenvectors whose eigenvalues are non-zero, (e.g., $\lambda_i > 0$).  These constitute your new set of predictor variables for subsequent analysis via regression, canonical correlation, redundancy, etc.

In this example, I'm going to use the dogwood data and look for spatial trends.  In this example, I'm just going to take the largest axis of varaition among individual genotypes (e.g., turn genotypes into a multivariate vector then do a PCA rotation on the genotypes and keep the first, most important, variable as a predictor).

```{r}
cornus_mv <- to_mv( cornus )
cornus_rot <- princomp(cornus_mv)
new_cornus_genetic <- predict(cornus_rot)
qplot(new_cornus_genetic[,1], new_cornus_genetic[,2], geom = "point") + xlab("Synthetic X1") + ylab("Synthetic X2") + theme_bw()
```

Next, we can take the spatial coordiantes and distance matrix that we derived previously and decompose it via a Principal Coordinate Rotation (which we already did).  The PCNM approach is encoded in the vegan library but to use it, we need to define a distance threshold first.  If we look at the distribution of spatial autocorrelation above, perhaps something around $250$ would be appropraite.

```{r message=FALSE, warning=FALSE}
library(vegan)
threshold <- 200
pred_pcnm <- pcnm(P,threshold)
```

We can visualize these new predictors by plotting the synthetic values in 2-space.  

```{r}
ordisplom( pred_pcnm, choices=1:4)
```

Then we can try to build up a multiple regression model, using these new PCNM axes as predictors and the first genetic structure variable as the response.

```{r}
resp <- new_cornus_genetic
min_resp <- min( resp )
resp <- resp + ceiling( abs(min_resp))
pred <- scores(pred_pcnm)
ord <- cca( resp ~ pred )
multiscale_ord <- mso(ord,coords[,2:3],grain = 5)
msoplot(multiscale_ord,xlim=c(0,200))
```


<!--chapter:end:eigenmaps.rmd-->

# (PART) Selection {-}

<!--chapter:end:part_selection.rmd-->

# Classical Selection Models {.imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_red_fish.jpg"></div>


> Selection is an evoluationary process that differentiates individuals.  It is the most fundamental mechanism of evoluation and in this exercise you will explore how to estimate the effects, at the level of the population, how allele and genotype frequencies change under various selection regimes.

Selection is a foundational process in population genetics and evolutionary biology.  It is based upon one of the few Theorem's we have in this discipline.  In this section, we will focus on developing the theory underlying single locus selection models and provide a general framework on which we can use in a variety of scenarios.

As a starting point, lets consider assume that this hypothetical species has:

- Non-overlapping generations,
- No inbreeding ($F=0$),
- An infinite population size (e.g., $N=\infty$), and
- No mutation ($\mu=0$)

For this hypothetical species, individuals differ in fertility and/or survival.  For simplicity, consider the life history represented below.

```{r echo=FALSE}
knitr::include_graphics("media/life_cycle.png")
```

To quantify this example, lets add parameters to differential success based upon genotypes as $m_{XY}$ for fertility and $\ell_{XY}$ for survival.  In our original nomenclature, we now have genotype frequencies as:

Genotype | Count    | Survival    | Fertility  
---------|:--------:|:-----------:|:-----------:
$P$      | $p^2N_t$ | $\ell_{AA}$ | $m_{AA}$  
$Q$      | $2pqN_t$ | $\ell_{AB}$ | $m_{AB}$  
$R$      | $q^2N_t$ | $\ell_{BB}$ | $m_{BB}$   


From these data, we can create the expectation for allele frequencies in the next generation.  For the **A** allele, we have:

\[
f(A) = \frac{N_t\left(2\ell_{AA}m_{AA}p^2 + m_{AB}\ell_{AB}2pq\right)}{N_t\left(2\ell_{AA}m_{AA}p^2 + m_{AB}\ell_{AB}2pq + 2\ell_{BB}m_{BB}q^2\right)}
\]

For simplicity, lets define the $\ell_{XY}m_{XY}$ as a parameter

\begin{aligned}
W_{AA} &= \ell_{AA}m_{AA} \\
W_{AB} &= \ell_{AB}m_{AB} \\
W_{BB} &= \ell_{BB}m_{BB}
\end{aligned}

which we will call the *Absolute Selective Value* (this makes our equations much easier to write without error), simplifying the above to:

\[
f(A) = \frac{p_t(p_tW_{AA} + q_tW_{AB})}{p_t^2W_{AA} + 2p_tq_tW_{AB} + q_t^2W_{BB}}
\]

The magnitude of these values depends upon the $\ell$ and $m$ parameters and can vary considerably across different taxa.  We can standardize these values by dividing by the maximum fertitlity, turning these into *Relative Selective Values* which are defined as:

Relative Selective Values | Absolute Value 
:------------------------:|:---------------------------------------------:
$w_{AA}$                  | $\frac{\ell_{AA}m_{AA}}{max(\ell_{XX}m_{XX})}$  
$w_{Aa}$                  | $\frac{\ell_{Aa}m_{Aa}}{max(\ell_{XX}m_{XX})}$  
$w_{aa}$                  | $\frac{\ell_{aa}m_{aa}}{max(\ell_{XX}m_{XX})}$  


For each genotype, we can break this down a bit more.  

Category              |       AA                    |          AB                 |       BB                    | Total
----------------------|:---------------------------:|:---------------------------:|:---------------------------:|:-------:
Relative Fitness      | $w_{AA}$                    | $w_{AB}$                    | $w_{BB}$                    | - 
Frequency             | $p^2$                       | $2pq$                       | $q^2$                       | 1.0
Weighted Contribution | $p^2w_{AA}$                 | $2pqw_{AB}$                 | $q^2w_{BB}$                 | $\bar{w}$
Frequency After       | $\frac{p^2w_{AA}}{\bar{w}}$ | $\frac{2pqw_{AB}}{\bar{w}}$ | $\frac{q^2w_{BB}}{\bar{w}}$ | 1.0

Notice here the row sums.  The sum of genotype frequencies is 1.0 (it better be), as is that for frequency after selection.  However, the sum of the weighted contribution is defined as $\bar{w}$, or *average fitness* of the population.  This standardizes the individuals genotypic contributions and can be used to estimate allele frequencies for the next generation as:

\[
p_{t+1} = \frac{p_t(p_tw_{AA}+q_tw_{AB})}{\bar{w}} 
\]

Lets use this relationship and define a function in \R that allows us to explore the relationship between the relative fitness values (the $w_{xx}$ values) and how allele frequencies change.  This function is going to be a pretty easy thing, essentially iterating the stuff through the equations above and iterate across generations.

So, instead of doing a bunch of comments, I'm going to just outline what needs to be done in order and let you look at the code and see how it is implemented.  Essentially, we need to write a function that:  

1. Takes $p$, the $w_{xx}$ values, and an estimate of how long to run (I'll call it $T$).
2. Iterate across $T$ generations.  For each generation we will:
  - Estimate $\bar{w}$, the mean population fitness
  - Estimate frequency for the next generation
3. Return the estimated frequencies.

In the code, it looks like:

```{r}
iterate_selection <- function(p=0.5, wAA=1, wAB=1, wBB=1, T=100) {
  freq <- rep(p,T)
  for(i in 2:T){
    pt <- freq[(i-1)]
    qt <- 1-pt
    wbar <- pt^2*wAA + 2*pt*qt*wAB + qt^2*wBB
    top <-  pt*(pt*wAA + qt*wAB)
    freq[i] <- top/wbar
  }
  return(freq)
}
```


This should work.  Let's test it. You should always try to see if your functions give you values you know are true, if it cannot then it is a poor function.  Here are some situations.

If I put no parameters, it should use the defaults and return 100 generations of 0.5 (all selective values are equal and $p = q$).

```{r}
iterate_selection()
```

It should be stable for other values of $p$ (I'll do 20 generations for brevity):

```{r}
iterate_selection(p=0.42,T=20)
```


If I make selection against one of the alleles, I should have fixation for the others.  Here I will select against the $A$ allele:


```{r}
iterate_selection(wAA=0,wAB=0,T=10)
```

OK, so it appears to be giving us stuff that what we expect.  From here it would be easy to explore a variety of different selection regimes.  Lets start with a simple example.Consider a single locus that contributes to variability in both fertility and survival.  Here are some parameters.  

- Individuals with the homozygote `AA` genotype average 100 offspring, twice as much as that for the heterozygote (`AB`) and four times as much as individuals with the `BB` genotype.  
- Offspring with the `AA` genotype have an observed survival of 75%, those with the `AB` genotype have 50%, and those with the `BB` have 20%.

Under this regime, the relative values are:

Genotype | $\ell_{XX}$ | $m_{XX}$ | $W_{XX}$ | $w_{XX}$ 
---------|:-----------:|:--------:|:--------:|:---------------
`AA`     | $0.75$      | $100$    | $75$     | $1.0$
`AB`     | $0.50$      | $50$     | $25$     | $\frac{1}{3}$
`BB`     | $0.20$      | $25$     | $5$      | $\frac{1}{15}$

Through time, this would change allele frequencies (assuming they start out at $p=q$) as:

```{r message=FALSE, warning=FALSE}
library(ggplot2)
df <- data.frame( Generation=1:10, p=iterate_selection(wAB=1/3, wBB=1/15, T=10))
ggplot( df, aes(x=Generation, y=p)) + geom_line() + geom_point() + ylim(c(0,1))
```

So, through time, we see a relatively rapid change in allele frequencies.  This is evolution, *sensu stricto*.  To parameterize the amount of change in allele frequencies we can take the equation for $p_{t+1}$ and $p_t$ and solve throwing away the subscripts of which generation is which:

\begin{aligned}
\delta p &= p_1 - p_0 \\
        &= \frac{p^2w_{AA} + pqw_{AB}}{\bar{w}} \\
        &= \frac{p^2w_{AA} + pqw_{AB}}{\bar{w}} - p_0\\
        &= \frac{p^2w_{AA} + pqw_{AB}}{\bar{w}} - p_0\frac{\bar{w}}{\bar{w}}\\
        &= \frac{p^2w_{AA} + pqw_{AB} + p_0(q^2w_{AA} + 2pqw_{AB} + q^2w_{BB})}{\bar{w}}\\
        &\cdots\;several\;simplifications\;\cdots\\
        &= \frac{pq[p(w_{AA} - w_{AB}) - q(w_{BB} - w_{AB})]}{\bar{w}}
\end{aligned}


## The Strength of Selection

The strength of selection can be estimated from looking at the relative allele frequencies for populations before and after selection.  In the lecture, we covered an example based upon resistance to prion disease. There we quantifed genotypes frequencies as $P$, $Q$, and $R$ for those before selection and $\hat{P}$, $\hat{Q}$ and $\hat{R}$ for genotype frequencies after selection.  Their relationships are given by:

\begin{aligned}
\hat{P} &=  P\; \ell_{AA}\; m_{AA} \\
\hat{Q} &=  Q\; \ell_{AB}\; m_{AB} \\
\hat{R} &=  R\; \ell_{BB}\; m_{BB} \\
\end{aligned}

If we take the genotype frequencies and divide the homozygotes by the heterozygotes and reorganize the relationship, we get:

\begin{aligned}
  \hat{\ell}_{AA}\hat{m}_{AA} = & \frac{\hat{P}Q}{\hat{Q}P}\\
	\hat{\ell}_{BB}\hat{m}_{BB} = & \frac{\hat{R}Q}{\hat{Q}R}\\
\end{aligned}

And from this we can estimate fertility and survival values by genotype.  In the lecture, we assumed that fertility, the $m_{xx}$ parameters, were not changed so we dropped them (or at least thought they were all equal to 1.0).  

This is sufficient information for you to estimate the consequences of selection from real data.


<!--chapter:end:selection.rmd-->

# (PART) Backmatter {-}

<!--chapter:end:part_backmatter.rmd-->

# References {- .hanging}

Bivand RS, Pebesma EJ, G\'omez-Rubio V.  2008.  *Applied Spatial Data Analysis with R*  Springer, New York.

Bray, J. R. and J. T. Curtis. 1957. An ordination of upland forest communities of southern Wisconsin. Ecological *Monographs*, **27**, 325-349

Borcard D, Legendre P.  2002.  All-scale spatial analysis of ecological data by means of principal coordinates of neighbour matrices.  *Ecological Modelling*, **153**, 51-68.

Cavalli-Sforza LL and Edwards AWF. 1967. Phylogenetic analysis: models and estimation procedures. *American Journal of Human Genetics*, **19**, 233-257.

Dyer RJ, Nason JD. 2004. Population Graphs: the graph theoretic shape of genetic structure. *Molecular Ecology*, **13** 1713-1727.

Dyer RJ, Westfall RD, Sork VL, Smouse PE. 2004. Two-generation analysis of pollen flow across a landscape V: a stepwise approach for extracting factors contributing to pollen structure. *Heredity*, **92**, 204-211.

Dyer RJ, Nason JD, Garrick RC. 2010. Landscape modeling of gene flow: improved power using conditional genetic distance derived from the topology of population networks. *Molecular Ecology*, **19**, 3746-3759.

Excoffier, L., Smouse, P.E., and Quattro, J.M. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. *Genetics*, **131**, 479-491.

Garrick RC, Nason JD, Fernandez-Manjaress JFF, Dyer RJ. In Press. Ecological co-associations influence species' responses to past climatic change: an example from a Sonoran Desert bark beetle. *Molecular Ecology*

Legendre P, Fortin MJ.  2010.  Comparison of the Mantel test and alternative approaches for detecting complex multivariate relationships in the spatial analysis of genetic data. *Molecular Ecology Resrouces*, **10**, 831-844.

McRae, B.H. 2006. Isolation by resistance. *Evolution*, **60**, 1551-1561.

Nei, M. 1978. Estimation of average heterozygosity and genetic distance from a small number of individuals. *Genetics*, **76**, 379-390.

Smouse PE and Peakall R. 1999. Spatial autocorrelation analysis of multi-allele and multi-locus genetic microstructure. *Heredity*, **82**, 561-573. 

Smouse PE, Dyer RJ, Westfall RD, Sork VL. 2001. Two-generation analysis of pollen flow across a landscape I. Male gamete heterogeneity among females. *Evolution*, **55**, 260-271.

Weir BS, Cockerham CC.  1984.  Estimating F-Statistics for the Analysis of Population Structure.  *Evolution*, **38**, 1358-1370.

Wright, S. 1978. Evolution and the Genetics of Populations, *Vol. 4: Variability Within and Among Natural Populations*. University of Chicago Press, Chicago.


<!--chapter:end:references.rmd-->

# Included Data Sets

The content of this book includes several data sets that come with the `gstudio` and `popgraph` libraries.  Here is a brief overview of those data sets with references to the manuscripts from which they were analyzed.


## The Sonoran desert bark beetle, *Araptus attenuata*

```{r echo=FALSE}
knitr::include_graphics("media/araptus.png")
```

*Arapatus attenuatus* is a bark beetle endemic to the Sonoran desert.  It is known only from the senescing stems of the desert plant, *Euphorbia lomelii* (syn *Pedilanthus macrocarpus*).  Samples for this data set were collected from 39 populations throughout the species range.

```{r message=FALSE, echo=FALSE, warning=FALSE, fig.cap="Sampling locales for the `arapat` data set."}
library(ggplot2)
library(ggmap)
library(gstudio)
library(ggrepel)
data(arapat)
coords <- strata_coordinates(arapat)
map <- population_map(coords)
ggmap(map) + geom_point(aes(x=Longitude,y=Latitude), data=coords) + xlab("Longitude") + ylab("Latitude") + geom_text_repel(aes(x=Longitude, y=Latitude, label=Stratum), data=coords) 
```


### Nuclear Markers

The markers in the `arapat` data set consist of eight polymorphic and codominant nuclear markers. 

```{r message=FALSE}
library(gstudio)
data(arapat)
column_class( arapat, "locus") 
```

These markers span a range of allelic diversity and richness.

```{r echo=FALSE}
df <- merge( genetic_diversity(arapat,mode="A"), genetic_diversity(arapat,mode="He"))
df <- merge( df, genetic_diversity(arapat, mode="Ae"))
knitr::kable(df,align=c("l","c","c","c"),digits=3)
```


### Methylation Markers

In addition to codominant nuclear markers, an msAFLP analysis was performed on the major clade and paired sequence and $C_pG$ methylation markers were derived.


## The Flowering Dogwood *Cornus florida*

```{r echo=FALSE, fig.cap="Inflorescence for flowering dogwood with conspicuous showy bracts and many small flowers in the center."}
knitr::include_graphics("media/dogwood_flower.png",)
```


Flowering dogwood is an ubiquitous feature of eastern north american forests. 

The data set contains five microsatellite loci.

```{r}
data(cornus)
column_class( cornus, "locus") 
```

of roughly equal genetic diversity.

```{r echo=FALSE}
df <- merge( genetic_diversity(cornus,mode="A"), genetic_diversity(cornus,mode="He"))
df <- merge( df, genetic_diversity(cornus, mode="Ae"))
knitr::kable(df,align=c("l","c","c","c"),digits=3)
```


<!--chapter:end:data_sets.rmd-->

# About the Author {- .imageChapter}

<div class="chapter_image"><img src="chapter_images/ch_author.jpg"></div>

Rodney Dyer is an Associate Professor at Virginia Commonwealth University.  When not writing code and developing new analytical approaches for understanding the spatial distribution of genetic variation, he enjoys fishing, fermentation sciences, and spending time with his family outdoors.


<!--chapter:end:about_the_author.rmd-->