#Exploring Patterns of Population Structure and Environmental Associations to Aridity Across the Range of Loblolly Pine

##Introduction

In this set of analyses, we will be making use of data from the Eckert et al. 2010 paper to explore patterns of phenotypic and environmental associations among populations of loblolly pine.


###Abstract

Natural populations of forest trees exhibit striking phenotypic adaptations to diverse environmental
gradients, thereby making them appealing subjects for the study of genes underlying ecologically relevant phenotypes. Here, we use a genome-wide data set of single nucleotide polymorphisms genotyped across 3059 functional genes to study patterns of population structure and identify loci associated with aridity across the natural range of loblolly pine (Pinus taeda L.). Overall patterns of population structure, as inferred using principal components and Bayesian cluster analyses, were consistent with three genetic clusters likely resulting from expansions out of Pleistocene refugia located in Mexico and Florida. A novel application of association analysis, which removes the confounding effects of shared ancestry on correlations between genetic and environmental variation, identified five loci correlated with aridity. These loci were primarily involved with abiotic stress response to temperature and drought. A unique set of 24 loci was identified as FST outliers on the basis of the genetic clusters identified previously and after accounting for expansions out of Pleistocene refugia. These loci were involved with a diversity of physiological processes. Identification of nonoverlapping sets of loci highlights the fundamental differences implicit in the use of either method and suggests a pluralistic, yet complementary, approach to the identification of genes underlying ecologically relevant phenotypes.


##Overview of tasks

In general, what you will be doing is working your way from loading and saving data related to this study, to corrections for population structure, to looking for associations between genotypes and phenotypes, genotypes and the environment (`Bayenv2`), and genotypes+phenotypes+environment (`SQUAT`)

## This notebook

The goal of this notebook is to get set up and explore some basic patterns of the data.  At many of the steps, there is also a sanity check so that a quick view of the data (via the `head()` function of a `Pandas` dataframe).

As you work your way through these notebooks, please keep in mind that there are several things that are going on:

1. You are getting exposed to an amazing technology through the use of the IPython Notebook.
1. You are learning a little bit of Python.
1. You are getting familiar with the ways in which I actually do my analysis.
1. You are taking with you about 75% of the processes needed to publish GWAS results on your own.

####Let's begin

At the top of every notebook, there is a cell of code which sets up some basic functionality.  At this point, it's sufficient to just run the cell and move on.  If you want to know what things do, please ask.  

In [0]:
import os, sys
from IPython.display import Image
import pandas as pd
from __future__ import division
import numpy as np
import rpy2
from rpy2 import robjects as ro
import pandas.rpy.common as com
import matplotlib.pyplot as plt
import seaborn as sns
import operator
import scipy as sp
import traceback
from sklearn import preprocessing
from IPython.parallel import Client
from subprocess import Popen, PIPE
import shutil
from IPython.display import FileLink, FileLinks, Image
import psutil
import multiprocessing
from hdfstorehelper import HDFStoreHelper
import warnings
import pandas
import dill
warnings.simplefilter("ignore", pandas.io.pytables.PerformanceWarning)
%matplotlib inline

%load_ext rpy2.ipython
pd.set_option('display.width', 80)
pd.set_option('max.columns', 30)

%load_ext autoreload
%autoreload 2

sns.set_context("talk")

####Execute the cell below to lead in the phenotype data

In [0]:
pheno = pd.read_excel("/gdc_home5/groups/congenomics/day5/landscape_genetics_data/Pinus_taeda_metabolite_data.xlsx", 
                      sheetname="metabolite_phenotype_data",
                      header=2)

####Let's look at all the traits in the file that contain the phrase, 'ose' (e.g., glucose, sucrose, etc.)

In [0]:
traits = [x for x in pheno.columns if "ose" in str(x)]
print traits

####In the cell above, you should see the name of your new favorite metabolite.  Pick your favorite one, and set it's value in the cell.

For example, to set the `trait_name` variable to `maltose`, you could do one of the following:

* `trait_name = "maltose"`
* `trait_name = traits[4]`

After you set `trait_name` to your favorite one, execute the cell.  The notebook will also dump a file that we can use later across notebooks, which remember, are separate Python processes.

In [0]:
trait_name = "fructose"
dill.dump(trait_name, open("trait_name.dill", "w"))

####We need to tweak the phenotype data a bit for later analysis, so run this cell.

In [0]:
pheno = pheno[['Longitude', 'Latitude','Clone_id',trait_name]]
pheno.index = pheno.Clone_id
pheno = pheno.drop('Clone_id', axis=1)
pheno = pheno.astype(float)

####Have a peek a the the phenotype data.

In [0]:
pheno.head()

####Load the rest of the data.  This will probably take a few minutes.

In [0]:
data_ai = pd.read_excel("/gdc_home5/groups/congenomics/day5/landscape_genetics_data/Genetics_2010/Eckert_Genetics_2010_data.xlsx")
data_gt = pd.read_excel("/gdc_home5/groups/congenomics/day5/landscape_genetics_data/Genetics_2010/Eckert_Genetics_2010_data.xlsx", 
                        sheetname="genotyping_data")
data_loc = pd.read_excel("/gdc_home5/groups/congenomics/day5/landscape_genetics_data/Genetics_2010/Eckert_Genetics_2010_data.xlsx",
                         sheetname="county_locality")

In [0]:
def add_county_state_key(df, c, s):
    d = df.copy()
    d['county_state'] = d.apply(lambda row: "%s_%s" % (row[c], row[s]), axis=1)
    return d

###The next three cells massage the data a bit more.

In [0]:
data_ai = add_county_state_key(data_ai, "County", "State")
data_ai.ix[:,0:2] = data_ai.ix[:,0:2].astype(str)
data_ai.ix[:,2:-1] = data_ai.ix[:,2:-1].astype(float)
data_ai.head()

In [0]:
data_loc = add_county_state_key(data_loc, "county", "state")
data_loc.ix[:,0:2] = data_loc.ix[:,0:2].astype(str)
data_loc.ix[:,2:-1] = data_loc.ix[:,2:-1].astype(float)
data_loc.head()

In [0]:
data_gt = data_gt.astype(str)
data_gt = add_county_state_key(data_gt, "county", "state")
data_gt.head()

####We'll be using HDF5 to store our data across notebooks, in addition to the `dill` serialization above.

This is a widely used technology for large and hierarchical data sets, though we won't really be using it that way.  If you want, you can read more about this technology from [HDF5 themselves](http://www.hdfgroup.org/HDF5/) or [PyTables](https://pytables.github.io/usersguide/tutorials.html).  Run through these `.put()` cells to save the data out to disk for later use.

You may also be wondering about the `HDFStoreHelper` module.  I wrote that to save myself some time and safety for common HDF5 things.  Feel free to make it your own.

In [0]:
hdf = HDFStoreHelper("data.hd5")

In [0]:
hdf.put("pheno", pheno)

In [0]:
hdf.put("data_ai", data_ai)

In [0]:
hdf.put("data_loc", data_loc)

In [0]:
hdf.put("data_gt", data_gt)

####Sometimes it's nice to have genotype data all by itself.  Let's filter our data down to just that by using the naming pattern in the columns of `data_gt`.

Run these cells to create the `DataFrame`, view a bit of it, and then save it to disk.

In [0]:
genotypes = data_gt.ix[:,[x for x in data_gt.columns if '-' in x]]

In [0]:
genotypes.head()

In [0]:
hdf.put("genotypes", genotypes)

####Let's load up some functions to some work later.  I feel like even though they're not commented, it's still pretty readable what it does.  

Feel free to browse the code and ask questions.  Not all programmers are the same, so the way my brain works might not make sense to you.

In [0]:
def is_homozygous(gt):
    if len(set([x.strip() for x in gt.split("/")])) == 1:
        return True
    return False

def get_allele_counts(counts):
    a = {}
    het = 0
    for gt in counts.index:
        for allele in [x.strip() for x in gt.split("/")]:
            if not allele in a:
                a[allele] = 0
            a[allele] += counts[gt]
        if not is_homozygous(gt):
            het += counts[gt]
    return sorted(a.items(), key=lambda x: x[1], reverse=True), het

def get_correction(n):
    #for finite sample size
    return (2*n)/(2*n-1)

def get_allele_freqs(locus):
    locus = locus[locus != '?/?']
    locus = locus[locus != 'NA']
    c = locus.value_counts()
    c = c.sort(inplace=False, ascending=False)
    allele_counts = get_allele_counts(c)
    total_alleles = 2.0*sum(c)
    num_individuals = sum(c)
    A = ""
    a = ""
    P = 0
    Q = 0
    if len(allele_counts[0]) == 2:
        A = allele_counts[0][0][0]
        a = allele_counts[0][1][0]
        P = allele_counts[0][0][1]
        Q = allele_counts[0][1][1]
    else:
        A = allele_counts[0][0][0]
        P = P = allele_counts[0][0][1]
    PQ = allele_counts[-1]
    p = P/total_alleles
    q = Q/total_alleles
    assert p + q == 1.0
    He = 2 * p * q * get_correction(num_individuals)
    Ho = PQ*1.0/num_individuals
    Fis = 1 - (Ho/He)
    #print p, q, He, Ho, Fis
    ret = pd.Series({"p":p, 
                      "q":q,
                      "P":P,
                      "Q":Q,
                      "He":He,
                      "Ho":Ho, 
                      "Fis":Fis,
                    "PQ": PQ,
                    "total_alleles":total_alleles,
                    "num_indiv":num_individuals,
                    "A":A,
                    "a":a})
    return ret

def plot_hist(df, index):
    d = df.ix[index,:]
    plt.hist(d, bins=20)
    plt.title("%s %.2f $\pm$ %.3f [%.2f, %.2f]" % (index, 
                                                   np.mean(d), 
                                                   np.std(d),
                                                  np.min(d),
                                                  np.max(d)))
    
def convert_to_z12(locus):
    freq = af[locus.name]
    trans = {"%s/%s" % (freq["A"],freq["A"]): 0,
            "%s/%s" % (freq["a"],freq["a"]): 2,
            "%s/%s" % (freq["A"],freq["a"]): 1,
            "%s/%s" % (freq["a"],freq["A"]): 1,
            "?/?":-1}
    return locus.apply(lambda x: trans[x])

def center_and_standardize_value(val, u, var):
    if val == -1:
        return 0.0
    return (val-u)/np.sqrt(var)

def center_and_standardize(snp):
    maf = af.ix["q",snp.name]
    u = np.mean([x for x in snp if x != -1])
    var = np.sqrt(maf*(1-maf))
    return snp.apply(center_and_standardize_value, args=(u, var))

####This next set of cells does a few things:

1. Creates a `DataFrame` to hold some allele frequency data, saves it, and plots Fis from it.
1. Coverts the base/base genotypes into a 0/1/2 matrix, like the one you might get out of `vcftools`.  Note that 2's are the minor allele.  -1 represents missing data.
1. Takes the 012 file and applies normalization to it (subtracting the value form the mean genotype and dividing by the variance of the minor allele frequency)

In [0]:
af = genotypes.apply(get_allele_freqs)

In [0]:
af.head()

In [0]:
hdf.put("af", af)

In [0]:
plot_hist(af, "Fis")

In [0]:
z12 = genotypes.apply(convert_to_z12)

In [0]:
hdf.put("z12", z12)

In [0]:
pca_std = z12.apply(center_and_standardize)

In [0]:
hdf.put("pca_std", pca_std)

In [0]:
z12.head()

####This is the standardization from Patterson et al. 2006, I talked about in the lecture.

In [0]:
pca_std.head()