Author: Dan Shea  
Date: 2019.08.15

# Inter-chromosomal Linkage Disequilibrium testing of common SNPs

Next, we will compare the set of SNPs common to all 20 founders (3,745 SNPs) by partitioning the genome (by chromosome) into $\mathscr{K}$ partitions. In rice this yields $| \mathscr{K} | = 12$. All of the SNPs $\mathscr{s}_{i} \in \mathscr{k}_{1}$ are then compared to all SNPs $\mathscr{s}_{j} \in {\mathscr{K} - \mathscr{k}_{1}}$. This yields all $\langle \mathscr{s}_{i}, \mathscr{s}_{j} \rangle$ pairs of SNPs. Since order is not important, we can proceed to the next chromosome $\mathscr{k}_2$ and compare SNPs $\mathscr{s}_{i}$ with $\mathscr{s}_{j} \in {\mathscr{K} - \mathscr{k}_{1} - \mathscr{k}_{2}$. Once we have done this for the first 11 chromosomes, we have generated all possible SNP combinations between SNPs on different chromosomes.

In the absence of _Linkage Disequilibrium_ (LD), we can expect a single SNP has $P(x) = 0.5$. Since we are looking at an _independent_ event at each position in our dimer (i.e. - the $\langle \mathscr{s}_{i}, \mathscr{s}_{j} \rangle$ pair) the _expected frequency of observation_ ($E(f_{obs})$) for any given dimer is simply $P(x,y) = P(x) \cdot P(y) = 0.5 \cdot 0.5 = 0.25$

We have an expected $f_{obs}$ and our actual $f^{\prime}_{obs}$, so we may perform a $\chi^{2}$ goodness-of-fit test to examine if the genotypes of the dimers observed is in line with the expectation that there is no LD between the two loci. (This expectation of no LD between loci on two different chromosomes being our $H_{0}$ for the $\chi^{2}$ test.)

Given that our SNP set is comprised of 3,745 loci and we want to examine pairs of SNPs, comparing every SNP to every other SNP would combinatorically be $C_{k}(n) = \frac{n!}{k!\cdot(n-k)!} = C_{2}(3,745) = \frac{3,745 \cdot 3,744}{2} = 7,010,640$. However, our actual number of comparisons will be smaller than this, because we further restrict pairs to be on different chromosomes.

After we complete our tests and receive p-values, we then FDR correct them using `statsmodels.stats` `multitest.fdrcorrection()` function. The resulting FDR-corrected p-values (i.e. - q-values) are then compared to $\alpha = 0.05$ and deemed significant if $q<\alpha$.

Inter-chromosomal Linkage Disequilibrium can provide evidence of epistatic interaction. _Epistasis_ is the effect of one gene's genotype influencing the expression of another gene.

OK, enough of me blathering, let's get started!

In [6]:
import pandas as pd
from scipy import stats
import numpy as np
from statsmodels.stats import multitest
import os
import os.path
from collections import OrderedDict, namedtuple

In [2]:
samples = ['N01','N03','N04','N05','N06','N07','N08','N09','N10','N11',
           'N12','N13','N14','N16','N17','N18','N19','N20','N21','N22',]
founders = ['KASALATH','KEIBOBA','SHONI','TUPA_121-3','SURJAMUKHI','RATUL','BADARI_DHAN','KALUHEENATI','JAGUARY','REXMONT',
            'URASAN','TUPA_729','DEE_JIAO_HUA_LUO','NERICA_1','TAKANARI','C8005','MOUKOTOU','NORTAI','SESIA','HAYAYUKI',]
datadirs = ['_'.join([x, y]) for x, y in zip(samples, founders)]

In [3]:
allele_frequency_files = [os.path.join('beagle_output', x, y+'_allele_frequencies.tsv') for x, y in zip(datadirs, samples)]
genotype_files = [os.path.join('beagle_output', x, y+'_genotypes.tsv') for x, y in zip(datadirs, samples)]

In [4]:
frequency_dfs = OrderedDict()
for key, value in zip(samples, allele_frequency_files):
    frequency_dfs[key] = pd.read_csv(value, sep='\t', index_col=0)
    
genotype_dfs = OrderedDict()
for key, value in zip(samples, genotype_files):
    genotype_dfs[key] = pd.read_csv(value, sep='\t', index_col=0)

  mask |= (ar1 == a)


In [5]:
data_dfs = OrderedDict()
for key in samples:
    data_dfs[key] = pd.concat([genotype_dfs[key], frequency_dfs[key]], axis=1)

In [6]:
# We no longer require these dfs, so let's delete them to free up some memory
del(frequency_dfs)
del(genotype_dfs)

In [7]:
# Read in the merged data that forms the 3,745 common SNPs
merged_data = pd.read_csv('beagle_output/Common_SNPS_all_founders.tsv', sep='\t', index_col=0)

In [8]:
# We only need the common loci, we extract the other information from our previously constructed data_dfs
merged_data = merged_data.loc[:, ['CHROM', 'POS']]

In [20]:
filtered_data_dfs = OrderedDict()
for key in samples:
    filtered_data_dfs[key] = pd.merge(merged_data, data_dfs[key], how='inner', on=['CHROM', 'POS'])

In [21]:
# We quickly sanity check this by ensuring the shapes are all 3,745 rows long
for key in filtered_data_dfs.keys():
    print('{} is {}'.format(key, filtered_data_dfs[key].shape))

N01 is (3745, 188)
N03 is (3745, 94)
N04 is (3745, 142)
N05 is (3745, 219)
N06 is (3745, 95)
N07 is (3745, 81)
N08 is (3745, 96)
N09 is (3745, 264)
N10 is (3745, 163)
N11 is (3745, 204)
N12 is (3745, 158)
N13 is (3745, 104)
N14 is (3745, 44)
N16 is (3745, 55)
N17 is (3745, 156)
N18 is (3745, 267)
N19 is (3745, 266)
N20 is (3745, 261)
N21 is (3745, 267)
N22 is (3745, 260)


In [22]:
# Define list of strings for all chromosome ids
chroms = [''.join(['chr','{0:02d}'.format(n)]) for n in range(1, 13)]

In [23]:
for key in samples:
    s_df = filtered_data_dfs[key].copy()
    filtered_data_dfs[key] = OrderedDict()
    for c in chroms:
        mask = s_df.CHROM == c
        filtered_data_dfs[key][c] = s_df.loc[mask, :]

In [24]:
for key in samples:
    for c in chroms:
        filtered_data_dfs[key][c] = pd.concat([filtered_data_dfs[key][c].loc[:,['CHROM', 'POS']],
                                               filtered_data_dfs[key][c].iloc[:, 11:-6]], axis=1)

In [26]:
# We no longer need the data_dfs, so let's free up some memory
del(data_dfs)

In [36]:
results = OrderedDict()
for key in samples:
    results[key] = list()
    c = chroms[:]
    while c:
        current_c = c.pop(0)
        for other_c in c:
            for row_i in range(0, filtered_data_dfs[key][current_c].shape[0]):
                for row_j in range(0, filtered_data_dfs[key][other_c].shape[0]):
                    pos_i = filtered_data_dfs[key][current_c].iloc[row_i ,1]
                    pos_j = filtered_data_dfs[key][other_c].iloc[row_j ,1]
                    dimers = filtered_data_dfs[key][current_c].iloc[row_i, 2:].to_numpy() + \
                             filtered_data_dfs[key][other_c].iloc[row_j, 2:].to_numpy()
                    f_obs = [0, 0, 0, 0]  # index 0=AA, 1=AB, 2=BA, 3=BB
                    for gt in dimers:
                        if gt == 'AA':
                            f_obs[0] += 1
                        elif gt == 'AB':
                            f_obs[1] += 1
                        elif gt == 'BA':
                            f_obs[2] += 1
                        elif gt == 'BB':
                            f_obs[3] += 1
                    f_exp = np.array([0.25, 0.25, 0.25, 0.25]) * len(dimers)
                    chisquare_val, pvalue = stats.chisquare(f_obs, f_exp)
                    results[key].append([current_c, pos_i, other_c, pos_j,
                                                    f_obs[0], f_obs[1], f_obs[2], f_obs[3],
                                                   chisquare_val, pvalue])
    # Now we've done all comparisons for a given founder
    # We can construct the dataframe for the results and perform FDR correction on the pvalues
    results[key] = pd.DataFrame(data=results[key], columns=['CHROM_a', 'POS_a', 'CHROM_b', 'POS_b',
                                                            'AA_obs', 'AB_obs', 'BA_obs', 'BB_obs',
                                                            'chisquare', 'pvalue'])
    significant, qvalue = multitest.fdrcorrection(results[key].loc[:, 'pvalue'].to_numpy())
    results[key]['qvalue'] = qvalue
    results[key]['significant'] = significant

In [37]:
# Now that we have the statistical analysis completed, dump the results to a tsv file so we can re-load later if necessary
for key, f in zip(samples, founders):
    results[key].to_csv('interchromosomal_linkage_analysis/{}_{}_interchromosomal_ld.tsv'.format(key, f),
                        sep='\t')

##### Below this point we can run the next cell to re-load the data from file as needed.
This cell is a placeholder and logical point to re-start analysis from if the jupyter kernel is halted for any reason.
If you're going through this cell by cell, the cell below *does not need to be run*, all it does is re-load the data calculated above from the files it was written out to upon completion of the $\chi^{2}$ testing.

In [3]:
# Re-load the data from file into results
samples = ['N01','N03','N04','N05','N06','N07','N08','N09','N10','N11',
           'N12','N13','N14','N16','N17','N18','N19','N20','N21','N22',]
founders = ['KASALATH','KEIBOBA','SHONI','TUPA_121-3','SURJAMUKHI','RATUL','BADARI_DHAN','KALUHEENATI','JAGUARY','REXMONT',
            'URASAN','TUPA_729','DEE_JIAO_HUA_LUO','NERICA_1','TAKANARI','C8005','MOUKOTOU','NORTAI','SESIA','HAYAYUKI',]
datadirs = ['_'.join([x, y]) for x, y in zip(samples, founders)]
results = OrderedDict()
for key, f in zip(samples, founders):
    results[key] = pd.read_csv('interchromosomal_linkage_analysis/{}_{}_interchromosomal_ld.tsv'.format(key, f),
                               sep='\t', index_col=0)

  mask |= (ar1 == a)


In [4]:
# Did things get loaded?
results['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a,CHROM_b,POS_b,AA_obs,AB_obs,BA_obs,BB_obs,chisquare,pvalue,qvalue,significant
0,chr01,6110684,chr02,891366,37,38,36,38,2.894737,0.408141,0.424731,False
1,chr01,6110684,chr02,1920981,35,40,40,35,3.163743,0.367062,0.387602,False
2,chr01,6110684,chr02,2404890,31,45,40,33,5.748538,0.124506,0.161255,False
3,chr01,6110684,chr02,2672932,32,43,40,33,5.105263,0.16425,0.199813,False
4,chr01,6110684,chr02,2687729,32,43,40,33,5.105263,0.16425,0.199813,False
5,chr01,6110684,chr02,2741043,31,45,40,33,5.748538,0.124506,0.161255,False
6,chr01,6110684,chr02,2827256,30,46,40,33,6.450292,0.091644,0.129735,False
7,chr01,6110684,chr02,3286095,33,44,40,33,4.660819,0.198386,0.232054,False
8,chr01,6110684,chr02,3349933,33,44,39,34,4.380117,0.223236,0.25565,False
9,chr01,6110684,chr02,3463117,33,44,40,32,5.140351,0.161804,0.197259,False


In [5]:
# Only retrieve significant LD results
# We only want results where alpha < 0.001
significant_results = OrderedDict()
for key in samples:
    significant_results[key] = results[key].loc[(results[key].significant == True) & (results[key].qvalue < 0.001), :].copy()

In [7]:
# Do we have data?
significant_results['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a,CHROM_b,POS_b,AA_obs,AB_obs,BA_obs,BB_obs,chisquare,pvalue,qvalue,significant
18387,chr01,20053392,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True
18388,chr01,20053392,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True
18389,chr01,20053392,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True
18623,chr01,20103165,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True
18624,chr01,20103165,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True
18625,chr01,20103165,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True
18859,chr01,20144421,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True
18860,chr01,20144421,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True
18861,chr01,20144421,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True
19095,chr01,20154857,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True


In [18]:
# Note: Originally, I was calculating the mean (centroid), but circos wants start/end positions, so I changed it. - djs
# We want to obtain coords for all loci showing interchromosomal LD
max_distance = 5000000
centroids = OrderedDict()
for key in samples:
    rows = significant_results[key].itertuples(index=False)
    # Prime zee pump!
    try:
        prev_row = next(rows)
    except StopIteration:
        # Iterator was empty so we call continue to move to the next sample
        continue
    # OK if we got here this sample should get a key in the OrderedDict()
    centroids[key] = list()
    # Set the initial locus coords
    block = {'CHROM_a': prev_row.CHROM_a, 'POS_a_start': prev_row.POS_a, 'POS_a_end': prev_row.POS_a,
             'CHROM_b': prev_row.CHROM_b, 'POS_b_start': prev_row.POS_b, 'POS_b_end': prev_row.POS_b,}
    for curr_row in rows:
        if (curr_row.CHROM_a == prev_row.CHROM_a) and (curr_row.CHROM_b == prev_row.CHROM_b):
            if (curr_row.POS_a - prev_row.POS_a <= max_distance) and (np.abs(curr_row.POS_b - prev_row.POS_b) <= max_distance):
                # Expand your locus to include this marker
                # Note: Due to the way the data is sorted the "a" is monotonically increasing
                block['POS_a_end'] = curr_row.POS_a
                # Note: But the "b" data increases and then "resets" at each new "a"
                #       This is due to the nature of how the combinatoric comparisons were performed.
                #       Thus we need to ensure that we should actually be setting new start or end coordinates
                #       for the "b" side of this block.
                if curr_row.POS_b > block['POS_b_end']:
                    block['POS_b_end'] = curr_row.POS_b
                if curr_row.POS_b < block['POS_b_start']:
                    block['POS_b_start'] = curr_row.POS_b
            else:
                # We just done entered a new locus dudebro!
                # Add the coord-pair for the locus to the list of coord-pairs for the sample
                #centroids[key].append([start_pos.CHROM_a, np.mean([start_pos.POS_a, end_pos.POS_a]),
                #                       start_pos.CHROM_b, np.mean([start_pos.POS_b, end_pos.POS_b])])
                centroids[key].append([block['CHROM_a'], block['POS_a_start'], block['POS_a_end'],
                                       block['CHROM_b'], block['POS_b_start'], block['POS_b_end']])
                # Now reset the start_pos & end_pos, because we are at a new locus
                block['CHROM_a'] = curr_row.CHROM_a
                block['POS_a_start'] = curr_row.POS_a
                block['POS_a_end'] = curr_row.POS_a
                block['CHROM_b'] = curr_row.CHROM_b
                block['POS_b_start'] = curr_row.POS_b
                block['POS_b_end'] = curr_row.POS_b
        else:
            # Walked into a new chromosome, add the coord-pair for the locus to the list of coord-pairs for the sample
            #centroids[key].append([start_pos.CHROM_a, np.mean([start_pos.POS_a, end_pos.POS_a]),
            #                       start_pos.CHROM_b, np.mean([start_pos.POS_b, end_pos.POS_b])])
            centroids[key].append([block['CHROM_a'], block['POS_a_start'], block['POS_a_end'],
                                   block['CHROM_b'], block['POS_b_start'], block['POS_b_end']])
            # Now reset the start & end_pos, because we are at a new locus
            block['CHROM_a'] = curr_row.CHROM_a
            block['POS_a_start'] = curr_row.POS_a
            block['POS_a_end'] = curr_row.POS_a
            block['CHROM_b'] = curr_row.CHROM_b
            block['POS_b_start'] = curr_row.POS_b
            block['POS_b_end'] = curr_row.POS_b
        # No matter what else happened up above, the curr_row is now the prev_row
        prev_row = curr_row
    # Now we're done with this sample, so let's make this list of lists into a DataFrame
    centroids[key] = pd.DataFrame(data=centroids[key], columns=['CHROM_a', 'POS_a_start', 'POS_a_end',
                                                                'CHROM_b', 'POS_b_start', 'POS_b_end'])

In [15]:
centroids.keys()

odict_keys(['N01', 'N03', 'N04', 'N05', 'N06', 'N07', 'N08', 'N09', 'N10', 'N11', 'N14', 'N16', 'N17', 'N18', 'N20', 'N21', 'N22'])

In [19]:
centroids['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a_start,POS_a_end,CHROM_b,POS_b_start,POS_b_end
0,chr01,20053392,22113938,chr02,31482718,34030257
1,chr01,22148052,22148052,chr02,18646610,18686916
2,chr01,22148052,22148052,chr02,29563080,34030257
3,chr01,22148902,22148902,chr02,18646610,18686916
4,chr01,22148902,22277753,chr02,29563080,34030257
5,chr01,22313907,22313907,chr02,18646610,18686916
6,chr01,22313907,22313907,chr02,29370785,34030257
7,chr01,22319807,22319807,chr02,18646610,18686916
8,chr01,22319807,22336584,chr02,29370785,34030257
9,chr01,22352813,22352813,chr02,18646610,18686916


In [22]:
for key in centroids.keys():
    centroids[key].to_csv('interchromosomal_linkage_analysis/{}_ICLD_5M_window_centroids.tsv'.format(key), sep='\t')

In [21]:
centroids['N01']

Unnamed: 0,CHROM_a,POS_a_start,POS_a_end,CHROM_b,POS_b_start,POS_b_end
0,chr01,20053392,22113938,chr02,31482718,34030257
1,chr01,22148052,22148052,chr02,18646610,18686916
2,chr01,22148052,22148052,chr02,29563080,34030257
3,chr01,22148902,22148902,chr02,18646610,18686916
4,chr01,22148902,22277753,chr02,29563080,34030257
5,chr01,22313907,22313907,chr02,18646610,18686916
6,chr01,22313907,22313907,chr02,29370785,34030257
7,chr01,22319807,22319807,chr02,18646610,18686916
8,chr01,22319807,22336584,chr02,29370785,34030257
9,chr01,22352813,22352813,chr02,18646610,18686916
