#Exploring Patterns of Population Structure and Environmental Associations to Aridity Across the Range of Loblolly Pine

##Introduction

In this set of analyses, we will be making use of data from the Eckert et al. 2010 paper to explore patterns of phenotypic and environmental associations among populations of loblolly pine.


###Abstract

Natural populations of forest trees exhibit striking phenotypic adaptations to diverse environmental
gradients, thereby making them appealing subjects for the study of genes underlying ecologically relevant phenotypes. Here, we use a genome-wide data set of single nucleotide polymorphisms genotyped across 3059 functional genes to study patterns of population structure and identify loci associated with aridity across the natural range of loblolly pine (Pinus taeda L.). Overall patterns of population structure, as inferred using principal components and Bayesian cluster analyses, were consistent with three genetic clusters likely resulting from expansions out of Pleistocene refugia located in Mexico and Florida. A novel application of association analysis, which removes the confounding effects of shared ancestry on correlations between genetic and environmental variation, identified five loci correlated with aridity. These loci were primarily involved with abiotic stress response to temperature and drought. A unique set of 24 loci was identified as FST outliers on the basis of the genetic clusters identified previously and after accounting for expansions out of Pleistocene refugia. These loci were involved with a diversity of physiological processes. Identification of nonoverlapping sets of loci highlights the fundamental differences implicit in the use of either method and suggests a pluralistic, yet complementary, approach to the identification of genes underlying ecologically relevant phenotypes.


##Overview of tasks

In general, what you will be doing is working your way from loading and saving data related to this study, to corrections for population structure, to looking for associations between genotypes and phenotypes, genotypes and the environment (`Bayenv2`), and genotypes+phenotypes+environment (`SQUAT`)

## This notebook

This notebook will walk you through getting SNP and phenotype associations, while correcting for population structure.  You will.

1. Remove the effect of population structure on the phenotype
1. Remove the effect of population structure on the genotype
1. Perform genotype-phenotype correlations for each SNP
1. Correct for multiple tests
1. Explore the full Patterson method (leave-one-out)

As with the previous notebook, execute the cell with the imports and continue

In [0]:
import os, sys
from IPython.display import Image
import pandas as pd
from __future__ import division
import numpy as np
import rpy2
from rpy2 import robjects as ro
import pandas.rpy.common as com
import matplotlib.pyplot as plt
import seaborn as sns
import operator
import scipy as sp
import traceback
from sklearn import preprocessing
from IPython.parallel import Client
from subprocess import Popen, PIPE
import shutil
from IPython.display import FileLink, FileLinks, Image
import psutil
import multiprocessing
from hdfstorehelper import HDFStoreHelper
import warnings
import pandas
import dill
import statsmodels as sm
import statsmodels.formula.api as smf
from scipy.stats.stats import pearsonr
warnings.simplefilter("ignore", pandas.io.pytables.PerformanceWarning)
%matplotlib inline

%load_ext rpy2.ipython
pd.set_option('display.width', 80)
pd.set_option('max.columns', 30)

%load_ext autoreload
%autoreload 2

sns.set_context("talk")

In [0]:
r = ro.r

In [0]:
%%R
library(qvalue)

In [0]:
hdf = HDFStoreHelper("data.hd5")

In [0]:
pca_drop_std = hdf.get("pca_drop_std")
pheno = hdf.get("pheno")
pca_cov = hdf.get("pca_cov")
trait_complete = hdf.get("trait_complete")

In [0]:
trait_complete_pca = trait_complete.join(pca_cov, how="inner").merge(pca_drop_std, 
                                                left_index=True, 
                                                right_index=True,
                                               suffixes = ('_hierf','_pca'))

In [0]:
trait_complete_pca

In [0]:
trait_name = str(dill.load(open("trait_name.dill")))

In [0]:
pca_cols = [x for x in trait_complete_pca if 'PC' in x]
pca_cols
pca_cov_string = "+".join(pca_cols)
pca_cov_string

####Did you know that you can do linear regression in Python?  

Well, there's that plus much, much more.  Ready to abandon `R` yet?

The cell below runs a linear regression between the phenotype data with the PCA axes as covariates.  After you run the fit, check out the summary.  Notice anything?

In [0]:
"%s~%s" % (trait_name, pca_cov_string)

In [0]:
p_fit = smf.ols(formula="%s~%s" % (trait_name, pca_cov_string), 
                data=trait_complete_pca).fit()

In [0]:
p_fit.summary()

In [0]:
p_fit.resid

####Because we have to use this data in `R` later, let's make it friendly by fixing "-" and the fact that some SNPs start with numbers.

In [0]:
trait_complete_pca.columns = [x.replace("-", "_") if '_pca' in x else x for x in trait_complete_pca.columns]
trait_complete_pca.columns = ["X%s" % x if '_pca' in x else x for x in trait_complete_pca.columns]
trait_complete_pca.head()

In [0]:
hdf.put("trait_complete_pca", trait_complete_pca)

####Let's do those regressions

1. For each SNP, remove effect of pop structure
1. Correlate that SNPs residual's with the residuals of the phenotype
1. Rejoice.

This takes a minute or two on this data set.  If you have 100,000 SNPs in your data, go get a beer (or 5)

In [0]:
snp_num = 0
snp_cor = {}
for col in trait_complete_pca:
    if "_pca" in col:
        formula="%s ~ %s" % (str(col), pca_cov_string)
        g_fit = smf.ols(formula=formula, data=trait_complete_pca).fit()
        corrected = pearsonr(g_fit.resid, p_fit.resid)
        snp_cor[col] = corrected
        snp_num += 1
        if snp_num % 100 == 0:
            print "at %d" % snp_num

####Let's store these so we don't have to run them later.  

In [0]:
snp_cor_df = pd.DataFrame(snp_cor, index=("r", "p")).T

In [0]:
snp_cor_df.head(20)

####We will use the Storey and Tibshirani (2003) method to correct for multiple tests using their `R` package available in `BioConductor`, `qvalue` 

In [0]:
def get_qvalues(pvalues):
    qvalue = r("qvalue")
    qobj = qvalue(pvalues)
    qvalues = qobj.rx2("qvalues")
    return np.array(qvalues)

In [0]:
q = get_qvalues(snp_cor_df.p)

In [0]:
snp_cor_df['q'] = q

In [0]:
snp_cor_df.head()

####We can also get a description of a column in a `DataFrame`

In [0]:
snp_cor_df.q.describe()

####This next bit of code is a bit complicated, but luckily you don't have to run it.

Patterson et al. suggest computing PCAs for each SNP such that the SNP that you're testing for association is not included in the data.  Can you think about why this might be the case?  Can you imagine how long this will take for large data sets.  A data set that I did this on for the talk I gave at Evolution 2014 was on about 79,000 SNPs and across 700 CPUs it still took about 6 hours.  

The code relies on IPython's parallel machinery to 

1. Set up a client to conenct to a cluster (after having started on)
1. Distributing the PCA function across the cluster
1. Running the jobs in parallel.  Caution here, if you have too many IPython engines running R processes, things sometimes crash.  No idea why.  Darn open source software.  I'm happy to talk about the code, but right now it's just here for reference.

```python
rc = Client(profile="gdcsrv2")
dview = rc[:]
lview = rc.load_balanced_view()
len(dview)
```

```python
def do_pca_in_parallel(locus):
    import os,sys
    sys.path.append("/gdc_home4/cfried/ipython/")
    os.environ['R_HOME'] = '/gdc_home4/cfried/R3/lib64/R'
    import rpy2.robjects as ro
    import pandas.rpy.common as com
    from pandas import DataFrame
    from hdfstorehelper import HDFStoreHelper
    hdf = HDFStoreHelper("/gdc_home5/groups/congenomics/day5/cfried/data.hd5")
    df = hdf.get("trait_complete_pca")
    r = ro.r
    prcomp = r('prcomp')
    df = df.ix[:,[x for x in df.columns if '_pca' in x]]
    df = df.drop(locus, axis=1)
    res = prcomp(com.convert_to_r_matrix(df), scale=False, center=False)
    x = com.convert_robj(res.rx2("x"))
    x.index = df.index
    return locus, x.ix[:,0:14].to_dict()

dview['do_pca_in_parallel'] = do_pca_in_parallel

pca_jobs = []
for col in trait_complete_pca:
    if "_pca" in col:
        job = lview.apply_async(do_pca_in_parallel, col)
        pca_jobs.append(job)
```

```python
ready = 0
pca_results = {}
for p in pca_jobs:
    if p.ready():
        res = p.r
        assert len(res) == 2
        pca_results[res[0]] = res[1]
        ready += 1
print ready, len(pca_jobs)

dill.dump(pca_results, open("pca_results.dill", "w"))

dill.dump(ols_results, open("ols_results.dill", "w"))

pca_results = dill.load(open("pca_results.dill"))

snp_num = 0
snp_cor2 = {}
for col in trait_complete_pca:
    if "_pca" in col:
        pca_cov2 = pd.DataFrame(pca_results[col])
        df = trait_complete_pca.merge(pca_cov2, how="inner", left_index=True,
                                     right_index=True, suffixes=('_x','_y'))
        pca_cov_string = "+".join(["%s_y" % x for x in pca_cov2.columns])
        formula="%s ~ %s" % (str(col), pca_cov_string)
        
        g_fit = smf.ols(formula=formula, data=df).fit()
        if snp_num % 100 == 0:
            print "at %d" % snp_num
        corrected = pearsonr(g_fit.resid, p_fit.resid)
        snp_cor2[col] = corrected
        snp_num += 1
        

snp_cor2_df = pd.DataFrame(snp_cor2, index=['r', 'p']).T
snp_cor2_df.head()
cors = snp_cor_df.merge(snp_cor2_df, how="inner", suffixes=['_1', '_2'], left_index=True, right_index=True)
hdf.put("snp_cor_df", snp_cor_df)
hdf.put("snp_cor2_df", snp_cor2_df)
hdf.put("cors", cors)
        
```

In [0]:
hdf2_file = "/gdc_home5/groups/congenomics/day5/cors.hdf"
hdf2 = HDFStoreHelper(hdf2_file)

In [0]:
cors = hdf.get("cors")
plt.scatter(cors.r_1, cors.r_2)
plt.xlabel("locus not removed")
plt.ylabel("locus removed")
plt.show()


In [0]:
cors.head()

In [0]:
q2 = get_qvalues(cors["p_2"])
cors["q2"] = q2

In [0]:
cors[cors.q < 0.05]

####How many SNPs are significantly correlated to your phenotype?

In [0]:
hdf.put("cors", cors)