#Exploring Patterns of Population Structure and Environmental Associations to Aridity Across the Range of Loblolly Pine

##Introduction

In this set of analyses, we will be making use of data from the Eckert et al. 2010 paper to explore patterns of phenotypic and environmental associations among populations of loblolly pine.


###Abstract

Natural populations of forest trees exhibit striking phenotypic adaptations to diverse environmental
gradients, thereby making them appealing subjects for the study of genes underlying ecologically relevant phenotypes. Here, we use a genome-wide data set of single nucleotide polymorphisms genotyped across 3059 functional genes to study patterns of population structure and identify loci associated with aridity across the natural range of loblolly pine (Pinus taeda L.). Overall patterns of population structure, as inferred using principal components and Bayesian cluster analyses, were consistent with three genetic clusters likely resulting from expansions out of Pleistocene refugia located in Mexico and Florida. A novel application of association analysis, which removes the confounding effects of shared ancestry on correlations between genetic and environmental variation, identified five loci correlated with aridity. These loci were primarily involved with abiotic stress response to temperature and drought. A unique set of 24 loci was identified as FST outliers on the basis of the genetic clusters identified previously and after accounting for expansions out of Pleistocene refugia. These loci were involved with a diversity of physiological processes. Identification of nonoverlapping sets of loci highlights the fundamental differences implicit in the use of either method and suggests a pluralistic, yet complementary, approach to the identification of genes underlying ecologically relevant phenotypes.


##Overview of tasks

In general, what you will be doing is working your way from loading and saving data related to this study, to corrections for population structure, to looking for associations between genotypes and phenotypes, genotypes and the environment (`Bayenv2`), and genotypes+phenotypes+environment (`SQUAT`)

## This notebook

This notebook gets you working with Bayenv2 data.  I've done a lot of the upfront work for you, so just worry about interpretation rather than running things.  Code to run is included, of course.

As with the previous notebook, execute the cell with the imports and continue

In [0]:
import os, sys
from IPython.display import Image
import pandas as pd
from __future__ import division
import numpy as np
import rpy2
from rpy2 import robjects as ro
import pandas.rpy.common as com
import matplotlib.pyplot as plt
import seaborn as sns
import operator
import scipy as sp
import traceback
from sklearn import preprocessing
from IPython.parallel import Client
from subprocess import Popen, PIPE
import shutil
from IPython.display import FileLink, FileLinks, Image
import psutil
import multiprocessing
from hdfstorehelper import HDFStoreHelper
import warnings
import pandas
import dill
import statsmodels as sm
import statsmodels.formula.api as smf
from scipy.stats.stats import pearsonr
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
warnings.simplefilter("ignore", pandas.io.pytables.PerformanceWarning)
%matplotlib inline

%load_ext rpy2.ipython
pd.set_option('display.width', 80)
pd.set_option('max.columns', 30)

%load_ext autoreload
%autoreload 2

sns.set_context("talk")

In [0]:
r = ro.r

In [0]:
hdf = HDFStoreHelper("data.hd5")

In [0]:
trait_name = str(dill.load(open("trait_name.dill")))

In [0]:
trait_name

####Let's load that Bayenv data from before

In [0]:
bayenv_df = hdf.get("bayenv_df")

In [0]:
bayenv_df.head()

####We need to add county and stuff back in, as well.

In [0]:
county_id = dill.load(open("county_id.dill"))

In [0]:
def add_county_id(row):
    key = "%s_%s" % (row.county,row.state)
    if key in county_id:
        return county_id[key]
    return np.nan
bayenv_df['countyid'] = bayenv_df.apply(add_county_id, axis=1)

In [0]:
bayenv_df.head(10)

In [0]:
bayenv_df = bayenv_df[bayenv_df.countyid > 0]

In [0]:
bayenv_df.head()

In [0]:
def convert_to_snpassoc(col):
    if "-" in col.name:
        freqs = af[col.name]
        trans = {11: "%s/%s" % (freqs["A"], freqs["A"]),
                12: "%s/%s" % (freqs["A"], freqs["a"]),
                22: "%s/%s" % (freqs["a"], freqs["a"]),
                "NA":"NA"}
        return col.apply(lambda x: trans[x])
    return col

def is_homozygous(gt):
    if len(set([x.strip() for x in gt.split("/")])) == 1:
        return True
    return False

def get_allele_counts(counts):
    a = {}
    het = 0
    for gt in counts.index:
        alleles = None
        if "/" in gt:
            alleles = [x.strip() for x in gt.split("/")]
        else:
            alleles = gt
        for allele in alleles:
            if not allele in a:
                a[allele] = 0
            a[allele] += counts[gt]
        if not is_homozygous(gt):
            het += counts[gt]
    return sorted(a.items(), key=lambda x: x[1], reverse=True), het


def get_correction(n):
    #for finite sample size
    return (2*n)/(2*n-1)

def get_allele_freqs(locus):
    locus = locus[locus != '?/?']
    locus = locus[locus != 'NA']
    c = locus.value_counts()
    c = c.sort(inplace=False, ascending=False)
    allele_counts = get_allele_counts(c)
    total_alleles = 2.0*sum(c)
    num_individuals = sum(c)
    A = ""
    a = ""
    P = 0
    Q = 0
    if len(allele_counts[0]) == 2:
        A = allele_counts[0][0][0]
        a = allele_counts[0][1][0]
        P = allele_counts[0][0][1]
        Q = allele_counts[0][1][1]
    else:
        A = allele_counts[0][0][0]
        P = P = allele_counts[0][0][1]
    PQ = allele_counts[-1]
    p = P/total_alleles
    q = Q/total_alleles
    assert p + q == 1.0
    He = 2 * p * q * get_correction(num_individuals)
    Ho = PQ*1.0/num_individuals
    Fis = 1 - (Ho/He)
    #print p, q, He, Ho, Fis
    ret = pd.Series({"p":p, 
                      "q":q,
                      "P":P,
                      "Q":Q,
                      "He":He,
                      "Ho":Ho, 
                      "Fis":Fis,
                    "PQ": PQ,
                    "total_alleles":total_alleles,
                    "num_indiv":num_individuals,
                    "A":A,
                    "a":a})
    return ret

In [0]:
af = hdf.get("af")
bayenv_df = bayenv_df.apply(convert_to_snpassoc)

####We need to get allele frequencies by population

This also takes a few minutes.

In [0]:
pop_allele_freqs = {}
for pop,data in bayenv_df.groupby("countyid"):
    print "getting allele freqs for pop % d" % pop
    pop_allele_freqs[pop] = data.ix[:,6:].apply(get_allele_freqs)

####The allele freqs data from above also returns counts of each allele, which is really what we need for Bayenv.  

Remember that data we called, `af` before?  This is the same, but for each population.  I call the counts of major and minor allele, P and Q, and their frequencies p and q, respectively.  Probably a bad habit, I admit.

Let's get those counts for the SNPs and put it into a file that Bayenv can understand.  The Bayenv file format also sucks, and it's worth saying it to all of your friends.  Who ends lines with a tab?

In [0]:
def get_bayenv_snp(snp_name, popids):
    P = []
    Q = []
    for popid in popids:
        P.append(pop_allele_freqs[popid].ix["P",name])
        Q.append(pop_allele_freqs[popid].ix["Q",name])
    return P, Q

def write_bayenv_snp(fh_snp, fh_names, name, P, Q):
    if sum(Q) > 0: #exclude monomorphic loci
        if fh_names:
            fh_names.write("%s\n" % name)
        P = [str(x) for x in P]
        Q = [str(x) for x in Q]
        fh_snp.write("%s\t\n" % "\t".join(Q))
        fh_snp.write("%s\t\n" % "\t".join(P))

In [0]:
bayenv_dir = "bayenv"
snp_names = [x for x in bayenv_df.columns if "-" in x]
popids = sorted(bayenv_df.countyid.unique())

if not os.path.exists(bayenv_dir):
    os.mkdir(bayenv_dir)

with open("bayenv.txt", "w") as o:
    with open("bayenv_names.txt", "w") as n:
        for name in snp_names:
            P,Q = get_bayenv_snp(name, popids)
            write_bayenv_snp(o, n, name, P, Q)

####Let's look at the first two (wrapped) lines of the file.  

Look familiar?

In [0]:
!head -n2 bayenv.txt

####Copy that file into your bayenv directory for use later, as below.  

Have you noticed that IPython can also call shell commands on the server?

In [0]:
!cp bayenv.txt bayenv/

##Run Bayenv to create variance-covariance matrix

```bash
    cd bayenv && /gdc_home4/cfried/src/bayenv2/bayenv2 -i bayenv.txt -p 30 -k 100000 -r 63479 > matrix.out
```

* -p number of populations (`len(popids)`)
* -k mcmc generations
* -r random seed

This has already been done for you since it takes several hours for 100,000 generations.  This same data takes about 15 hours for 1,000,000 generations.

####As with all things Bayesian, it's good to look at the data to make sure that the MCMC chain is mixing properly.

This is just one way, using pearson correlation, and checking out the correlation between the last matrix and all previous ones.  You can imagine any number of scenarios, I'm sure, but it seems to overall converge very quickly.

In [0]:
vcovs = []
current = None
for line in open("bayenv/matrix.out"):
    if "VAR-COVAR" in line:
        current = []
        vcovs.append(current)
    if isinstance(current, list):
        current.append(line.strip().split("\t"))
    

In [0]:
vcov_dfs = []
for i, elem in enumerate(vcovs):
    vcov_dfs.append(pd.DataFrame(vcovs[i][1:]).T)

In [0]:
matrix_correlation = np.zeros((len(vcovs), len(vcovs)))
for i in xrange(len(vcovs)):
    if i == len(vcovs)-1:
        for j in xrange(i+1):
            idf = vcov_dfs[i]
            jdf = vcov_dfs[j]
            idf = idf.ix[:,:len(idf)-1]
            jdf = jdf.ix[:,:len(jdf)-1]
            idf = [float(x) for x in idf.values.flatten()]
            jdf = [float(x) for x in jdf.values.flatten()]
            assert len(idf) == len(jdf)
            matrix_correlation[i, j] = sp.stats.pearsonr(idf, jdf)[0]
            matrix_correlation[j, i] = matrix_correlation[i, j]

In [0]:
ys = []
for i in xrange(len(matrix_correlation)):
    for j in xrange(i):
        if i == len(vcovs)-1:
            ys.append(matrix_correlation[i,j])
plt.plot(range(len(ys)),ys)
plt.title("Pearson correlations among %d adjacent VCOV matrices" % len(matrix_correlation))
plt.show()

####Does this tell you anything about the process?  Are you comfortable choosing the last matrix?

####One way to view the covariance structure of a matrix (a moderately sized one) is a heatmap. 

In Python, here's one way.

In [0]:
vcov = pd.read_csv("bayenv/matrix_last.out", sep="\t", header=None)
vcov = vcov.ix[:,:len(vcov.columns)-2]
sns.heatmap(vcov)
plt.show()

## Run Bayenv mcmc

The code below will walk you through getting this all set up, but it will not run it. Bayenv is best run with GNU `parallel` I've found, and also pinning to single CPUs because of how their software is programmed.  

In [0]:
data_ai = hdf.get("data_ai")
data_ai['county_state'] = data_ai.apply(lambda row: "%s_%s" % (row.County, row.State), axis=1)
bayenv_df_ai = bayenv_df.merge(data_ai, on='county_state')
bayenv_df_ai.head()

In [0]:
bayenv_df_ai.shape

In [0]:
def get_bayenv_env(data):
    E = pd.Series()
    for col in data.columns[:-1]:
        E[col] = data[col].values[0]
    return E

ai_cols = [x for x in bayenv_df_ai if 'AI_' in x]
ai_cols.append('countyid')
bayenv_df_ai_groups = bayenv_df_ai.ix[:,ai_cols].groupby("countyid")
env_ai = []
for popid in popids:
    env_ai.append(get_bayenv_env(bayenv_df_ai_groups.get_group(popid))) 
env_ai_df = pd.DataFrame(env_ai).T
env_ai_df = env_ai_df.apply(preprocessing.scale, axis=1)
env_ai_df

In [0]:
def add_last_column(row):
    row[len(row)] = ""
    return row 

env_ai_df.apply(add_last_column, axis=1).to_csv("envmatrix.txt", 
                                               header=None,
                                                index=True,
                                               sep="\t")

In [0]:
!cp bayenv/matrix_last.out .

In [0]:
def setup_bayenv_cmd(snpfile, name):
    work_dir = os.path.join(os.path.abspath("."), "bayenv")
    bayenv = "/gdc_home4/cfried/src/bayenv2/bayenv2"
    bayenv_matrix = "matrix_last.out"
    bayenv_seed = -47372
    bayenv_pops = 12
    bayenv_runs = 100000
    bayenv_environs = 4
    bayenv_envmatrix = "envmatrix.txt"
    bayenv_cmd = "cd %s/%s && %s -i %s -m %s -e %s -p %d -k %d -n %d -t -c -f -X -o %s" % (work_dir, 
                                                                                        name,
                                                                                        bayenv,
                                                                         snpfile,
                                                                         bayenv_matrix,
                                                                         bayenv_envmatrix,
                                                                         bayenv_pops,
                                                                     bayenv_runs,
                                                                     bayenv_environs,
                                                                             snpfile)
    shutil.copy(bayenv_matrix, os.path.join(work_dir, name))
    shutil.copy(bayenv_envmatrix, os.path.join(work_dir, name))
    return bayenv_cmd


In [0]:
cmds = []
if not os.path.exists(bayenv_dir):
    os.mkdir(bayenv_dir)

for name in snp_names:
    P,Q = get_bayenv_snp(name,popids)
    if sum(Q) > 0:
        file_dir = os.path.join(bayenv_dir, name)        
        if not os.path.exists(file_dir):
            os.mkdir(file_dir)
        o = open(os.path.join(file_dir, "%s.txt" % name), "w")
        write_bayenv_snp(o, None, name, P, Q)
        o.close()
        cmd = setup_bayenv_cmd(os.path.basename(o.name), name)
        cmds.append(cmd)

In [0]:
with open("bayenv_jobs.txt", "w") as o:
    cpu = 0
    max_cpus = 20
    for cmd in cmds:
        c = cmd.split()
        c[2] = "&& taskset -c %d" % cpu
        o.write("%s\n" % " ".join(c))
        cpu += 1
        if cpu == max_cpus:
            cpu = 0
        

## Run all the commands

But don't, I already did it for you.

```bash
cat bayenv_jobs.txt | parallel -j 20 --eta --
```

####One of the commands looks like this, if you're curious

In [0]:
cmds[0]

####We can iterate the bayenv directory and find all of the files that contain bf.  

Luckily none of my files have bf in the name other than the file extension. Be careful with regexes.

In [0]:
bf_files = !find {bayenv_dir} | grep bf
len(bf_files)

In [0]:
bf_files[0:10]

In [0]:
bf_data = {}
for b in bf_files:
    d = open(b).readlines()
    d = d[-1].strip().split("\t")[1:]
    if len(d) == 12:
        bf_data[os.path.basename(b).replace(".txt.bf","")] = d

In [0]:
bf = pd.DataFrame(bf_data).T.astype(float)
bf.shape

In [0]:
bf.head()

In [0]:
plt.scatter(bf.ix[:,1], bf.ix[:,2])
plt.xlabel("Spearman")
plt.ylabel("Pearson")
plt.show()

plt.scatter(bf.ix[:,1], bf.ix[:,0])
plt.xlabel("Spearman")
plt.ylabel("Bayes factor")
plt.show()

##Outlier analysis

In [0]:
def get_outliers(df, key, num_std):
    if key == "bf":
        key = 0
    elif key == "rho":
        key = 1  
    outliers = {}   
    ai = 0
    for i in xrange(key, len(df.columns), 3):
        d = df.ix[:,i]
        d_std = np.std(d)
        d_mean = np.mean(d)
        cutoffs = [d_mean + (num_std*d_std), d_mean - (num_std*d_std)]
        env = ai_cols[ai]
        outliers[env] = d[(d >= cutoffs[0]) | (d <= cutoffs[1])]
        ai += 1
    return outliers

In [0]:
def plot_outliers(df, key, num_std):
    if key == "bf":
        key = 0
    elif key == "rho":
        key = 1   
    ai = 0
    for i in xrange(key, len(df.columns), 3):
        d = df.ix[:,i]
        d_std = np.std(d)
        d_mean = np.mean(d)
        env = ai_cols[ai]
        ax = plt.gca()
        if key == 0:
            ax.set_yscale('log')
        plt.hist(d, bins=100)
        plt.xlim(np.min(d), d_mean+(num_std*d_std))
        plt.title("%s $\mu = %.4f \pm %.4f [%.4f, %.4f])$" % (env,
                                                            d_mean,
                                                            d_std,
                                                            np.min(d),
                                                            np.max(d)))
        plt.show()
        ai += 1

In [0]:
plot_outliers(bf, "bf", 20)

In [0]:
plot_outliers(bf, "rho", 6)

In [0]:
bf_outliers = get_outliers(bf, "bf", 6)    
rho_outliers = get_outliers(bf, "rho", 3)

In [0]:
%%R
library(VennDiagram)

In [0]:
def draw_venn(outliers, title):
    keys = sorted(list(outliers.keys()))
    a1 = set(outliers[keys[0]].index)
    a2 = set(outliers[keys[1]].index)
    a3 = set(outliers[keys[2]].index)
    a4 = set(outliers[keys[3]].index)
    area1 = len(a1)
    area2 = len(a2)
    area3 = len(a3) 
    area4 = len(a4)
    n12 = len(a1.intersection(a2))
    n13 = len(a1.intersection(a3))
    n14 = len(a1.intersection(a4))
    n23 = len(a2.intersection(a3))
    n24 = len(a2.intersection(a4))
    n34 = len(a3.intersection(a4))
    n123 = len(set.intersection(a1, a2, a3))
    n124 = len(set.intersection(a1, a2, a4))
    n134 = len(set.intersection(a1, a3, a4))
    n234 = len(set.intersection(a2, a3, a4))
    n1234 = len(set.intersection(a1, a2, a3, a4))
    venn = "venn_%s.png" % title.replace(" ", "_")
    r("library(VennDiagram)")
    r("png('%s')" % venn)
    r('draw.quad.venn')(area1, 
                  area2,
                  area3,
                  area4,
                  n12,
                  n13,
                  n14,
                  n23,
                  n24,
                  n34,
                  n123,
                  n124,
                  n134,
                  n234,
                  n1234,
                       category=keys)
    r('dev.off()')
    return venn

In [0]:
Image(draw_venn(bf_outliers, "Bayes factor outliers"))

In [0]:
Image(draw_venn(rho_outliers, "Rho outliers"))

In [0]:
combined_outliers = {}
for key in bf_outliers:
    a = bf_outliers[key].index
    b = rho_outliers[key].index
    combined_outliers[key] = pd.Series(index=a.intersection(b))

In [0]:
Image(draw_venn(combined_outliers, "combined"))

In [0]:
boxplot_data = {}
for key, val in bf_outliers.items():
    val = val.sort(inplace=False, ascending=False)
    boxplot_data[key] = {val.index[0]: val[0]}

In [0]:
boxplot_data

In [0]:
#bayenv_df_ai_basegt = bayenv_df_ai.apply(convert_to_snpassoc)
for env in boxplot_data:
    for snp in boxplot_data[env]:
        vals = {}
        for gt, group in bayenv_df_ai.groupby(snp):
            if not gt == 'NA':
                vals[gt.replace("/", "")] = group[env]
        vals = pd.DataFrame(vals, dtype=float)
        vals.index.name = env

        sns.boxplot([vals[x].dropna() for x in vals], 
                    names=vals.columns)
        plt.title("%s/%s (%.4f)" % (snp, vals.index.name, boxplot_data[env][snp]))
        plt.show()

        sns.violinplot([vals[x].dropna() for x in vals], 
                    names=vals.columns)
        plt.title("%s/%s (%.4f)" % (snp, vals.index.name, boxplot_data[env][snp]))
        plt.show()

In [0]:
perloc = hdf.get("perloc").T

In [0]:
xtx_files = !find {bayenv_dir} | grep xtx

In [0]:
xtx_data = {}
for f in xtx_files:
    f = open(f).readlines()[-1].split()
    f[0] = f[0].replace(".txt", "")
    xtx_data[f[0]] = float(f[1])

In [0]:
xtx = pd.Series(xtx_data)

In [0]:
xtx_perloc = pd.concat((xtx, perloc), axis=1)

In [0]:
cols = ["xtx"]
cols.extend(xtx_perloc.columns[1:])
xtx_perloc.columns = cols
xtx_perloc = xtx_perloc.dropna()
xtx_plot_data = xtx_perloc[xtx_perloc.Fst < 0.2]

In [0]:
plt.scatter(xtx_plot_data.xtx, xtx_plot_data.Fst)
plt.title("$X^{T}\!X$ vs. $F_{ST}$ for $n = %d$ loci" % (len(xtx_plot_data)))
plt.xlabel("$X^T\!X$")
plt.ylabel("$F_{ST}$")
m, b, r, p, se = sp.stats.linregress(xtx_plot_data.xtx.values, xtx_plot_data.Fst.values)
plt.plot(xtx_plot_data.xtx, (xtx_plot_data.xtx*m + b), c="r", lw=1)
plt.text(11, -0.075, "$y = %.4fx %.4f \ (r = %.4f, p = %.4f)$" % (m, b, r, p), fontsize=16)
plt.show()