Author: Dan Shea  
Date: 2019.09.11  
### Repulsion vs. Coupling of genotypes at loci showing interchromosomal LD
We compared the set of SNPs common to all 20 founders (3,745 SNPs) by partitioning the genome (by chromosome) into $\mathscr{K}$ partitions. In rice this yields $| \mathscr{K} | = 12$. All of the SNPs $\mathscr{s}_{i} \in \mathscr{k}_{1}$ were then compared to all SNPs $\mathscr{s}_{j} \in {\mathscr{K} - \mathscr{k}_{1}}$. This yielded all $\langle \mathscr{s}_{i}, \mathscr{s}_{j} \rangle$ pairs of SNPs. Since order is not important, we can proceed to the next chromosome $\mathscr{k}_2$ and compare SNPs $\mathscr{s}_{i}$ with $\mathscr{s}_{j} \in {\mathscr{K} - \mathscr{k}_{1} - \mathscr{k}_{2}$. We have done this for the first 11 chromosomes, generating all possible SNP combinations between SNPs on different chromosomes.

In the absence of _Linkage Disequilibrium_ (LD), we can expect a single SNP has $P(x) = 0.5$. Since we are looking at an _independent_ event at each position in our dimer (i.e. - the $\langle \mathscr{s}_{i}, \mathscr{s}_{j} \rangle$ pair) the _expected frequency of observation_ ($E(f_{obs})$) for any given dimer is simply $P(x,y) = P(x) \cdot P(y) = 0.5 \cdot 0.5 = 0.25$

We have an expected $f_{obs}$ and our actual $f^{\prime}_{obs}$, so we performed a $\chi^{2}$ goodness-of-fit test to examine if the genotypes of the dimers observed was in line with the expectation that there is no LD between the two loci. (This expectation of no LD between loci on two different chromosomes being our $H_{0}$ for the $\chi^{2}$ test.)

Given that our SNP set is comprised of 3,745 loci and we want to examine pairs of SNPs, comparing every SNP to every other SNP would combinatorically be $C_{k}(n) = \frac{n!}{k!\cdot(n-k)!} = C_{2}(3,745) = \frac{3,745 \cdot 3,744}{2} = 7,010,640$. However, our actual number of comparisons was smaller than this, because we further restricted pairs to be on different chromosomes.

After we completed our tests and receive p-values, we then performed an FDR correction on them using the `statsmodels.stats` `multitest.fdrcorrection()` function. The resulting FDR-corrected p-values (i.e. - q-values) were then compared to $\alpha = 0.001$ and deemed significant if $q<\alpha$.

Inter-chromosomal Linkage Disequilibrium can provide evidence of epistatic interaction. _Epistasis_ is the effect of one gene's genotype influencing the expression of another gene. However, another interesting thing to examine is whether the genotypes of the two positions are in __repulsion__ or __coupling__. That is to say, is the tendency of the two loci to be homozygous for the same parent genotype (i.e. - in coupling) or are they heterozygous having the genotype of one parent at one SNP and the genotype of another parent at the other SNP (i.e. - in repulsion). This is important because it can reveal epistatic interactions between genes where offspring viability is determined by genotype.

One example of this is seen in plant qualitative resistance to biotrophic pathogens. To defend themselves from pathogens, plants frequently feature a rapid hypersensitive response (HR). This is characterized by rapid and localized cell death (i.e. - apoptosis) to restrict the pathogen's replication during the early stages of the infection (Pontier et al., 1998). HR-associated cell death is a kind of programmed cell death (PCD), and plants with lethal combinations of R genes can trigger HR in the absence of a pathogen infection rendering the genotype non-viable.

By checking to see if particular regions between chromosomes are in _coupling_ or _repulsion_, we can help identify potential areas of interest to examine in more detail to see if genes within the regions are exhibiting genotypic skew.

行きましょう！

In [1]:
import pandas as pd
from scipy import stats
import numpy as np
from statsmodels.stats import multitest
import os
import os.path
from collections import OrderedDict, namedtuple

In [2]:
samples = ['N01','N03','N04','N05','N06','N07','N08','N09','N10','N11',
           'N12','N13','N14','N16','N17','N18','N19','N20','N21','N22',]
founders = ['KASALATH','KEIBOBA','SHONI','TUPA_121-3','SURJAMUKHI','RATUL','BADARI_DHAN','KALUHEENATI','JAGUARY','REXMONT',
            'URASAN','TUPA_729','DEE_JIAO_HUA_LUO','NERICA_1','TAKANARI','C8005','MOUKOTOU','NORTAI','SESIA','HAYAYUKI',]
datadirs = ['_'.join([x, y]) for x, y in zip(samples, founders)]

In [3]:
# Re-load the data from the chi-square testing
results = OrderedDict()
for key, f in zip(samples, founders):
    results[key] = pd.read_csv('interchromosomal_linkage_analysis/{}_{}_interchromosomal_ld.tsv'.format(key, f),
                               sep='\t', index_col=0)

  mask |= (ar1 == a)


In [4]:
# Did things get loaded?
results['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a,CHROM_b,POS_b,AA_obs,AB_obs,BA_obs,BB_obs,chisquare,pvalue,qvalue,significant
0,chr01,6110684,chr02,891366,37,38,36,38,2.894737,0.408141,0.424731,False
1,chr01,6110684,chr02,1920981,35,40,40,35,3.163743,0.367062,0.387602,False
2,chr01,6110684,chr02,2404890,31,45,40,33,5.748538,0.124506,0.161255,False
3,chr01,6110684,chr02,2672932,32,43,40,33,5.105263,0.16425,0.199813,False
4,chr01,6110684,chr02,2687729,32,43,40,33,5.105263,0.16425,0.199813,False
5,chr01,6110684,chr02,2741043,31,45,40,33,5.748538,0.124506,0.161255,False
6,chr01,6110684,chr02,2827256,30,46,40,33,6.450292,0.091644,0.129735,False
7,chr01,6110684,chr02,3286095,33,44,40,33,4.660819,0.198386,0.232054,False
8,chr01,6110684,chr02,3349933,33,44,39,34,4.380117,0.223236,0.25565,False
9,chr01,6110684,chr02,3463117,33,44,40,32,5.140351,0.161804,0.197259,False


In [5]:
# Great! Now we must filter for alpha < 0.001 significance
significant_results = OrderedDict()
for key in samples:
    significant_results[key] = results[key].loc[(results[key].significant == True) & (results[key].qvalue < 0.001), :].copy()

In [6]:
significant_results['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a,CHROM_b,POS_b,AA_obs,AB_obs,BA_obs,BB_obs,chisquare,pvalue,qvalue,significant
18387,chr01,20053392,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True
18388,chr01,20053392,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True
18389,chr01,20053392,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True
18623,chr01,20103165,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True
18624,chr01,20103165,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True
18625,chr01,20103165,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True
18859,chr01,20144421,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True
18860,chr01,20144421,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True
18861,chr01,20144421,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True
19095,chr01,20154857,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True


In [10]:
# Now, we can record which genotype is being skewed for by examining the max of the four columns for each row
for key in samples:
    significant_results[key]['skew'] = significant_results[key].iloc[:, 4:8].idxmax(axis=1)

In [11]:
significant_results['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a,CHROM_b,POS_b,AA_obs,AB_obs,BA_obs,BB_obs,chisquare,pvalue,qvalue,significant,skew
18387,chr01,20053392,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True,BB_obs
18388,chr01,20053392,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True,BB_obs
18389,chr01,20053392,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True,BB_obs
18623,chr01,20103165,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True,BB_obs
18624,chr01,20103165,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True,BB_obs
18625,chr01,20103165,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True,BB_obs
18859,chr01,20144421,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True,BB_obs
18860,chr01,20144421,chr02,32661204,34,27,32,64,20.859649,0.000113,0.000865,True,BB_obs
18861,chr01,20144421,chr02,32676129,34,27,32,64,20.859649,0.000113,0.000865,True,BB_obs
19095,chr01,20154857,chr02,31482718,36,24,33,64,22.076023,6.3e-05,0.000523,True,BB_obs


#### Next, we have to re-write the Interchromosomal LD binning to also account for coupling AA, coupling BB, or repulsion
Since coupling can be skewed either `AA` or `BB` and heterozygous skew is not directional (`AB` or `BA`), we have 3 possible colorings for the links. So when extending bins, we must not only account for distance, but also type of skew.

In [20]:
# We want to obtain coords for all loci showing interchromosomal LD and account for coupling vs repulasion
max_distance = 5000000
loci = OrderedDict()
for key in samples:
    rows = significant_results[key].itertuples(index=False)
    # Prime zee pump!
    try:
        prev_row = next(rows)
    except StopIteration:
        # Iterator was empty so we call continue to move to the next sample
        continue
    # OK if we got here this sample should get a key in the OrderedDict()
    loci[key] = list()
    # Set the initial locus coords
    block = {'CHROM_a': prev_row.CHROM_a, 'POS_a_start': prev_row.POS_a, 'POS_a_end': prev_row.POS_a,
             'CHROM_b': prev_row.CHROM_b, 'POS_b_start': prev_row.POS_b, 'POS_b_end': prev_row.POS_b,
             'skew': prev_row.skew,}
    for curr_row in rows:
        if (curr_row.CHROM_a == prev_row.CHROM_a) and (curr_row.CHROM_b == prev_row.CHROM_b):
            if (curr_row.POS_a - prev_row.POS_a <= max_distance) and (np.abs(curr_row.POS_b - prev_row.POS_b) <= max_distance) and (curr_row.skew == prev_row.skew):
                # Expand your locus to include this marker
                # Note: Due to the way the data is sorted the "a" is monotonically increasing
                block['POS_a_end'] = curr_row.POS_a
                # Note: But the "b" data increases and then "resets" at each new "a"
                #       This is due to the nature of how the combinatoric comparisons were performed.
                #       Thus we need to ensure that we should actually be setting new start or end coordinates
                #       for the "b" side of this block.
                if curr_row.POS_b > block['POS_b_end']:
                    block['POS_b_end'] = curr_row.POS_b
                if curr_row.POS_b < block['POS_b_start']:
                    block['POS_b_start'] = curr_row.POS_b
            else:
                # We just done entered a new locus dudebro!
                # Add the coord-pair for the locus to the list of coord-pairs for the sample
                #loci[key].append([start_pos.CHROM_a, np.mean([start_pos.POS_a, end_pos.POS_a]),
                #                       start_pos.CHROM_b, np.mean([start_pos.POS_b, end_pos.POS_b])])
                loci[key].append([block['CHROM_a'], block['POS_a_start'], block['POS_a_end'],
                                  block['CHROM_b'], block['POS_b_start'], block['POS_b_end'],
                                  block['skew']])
                # Now reset the start_pos & end_pos, because we are at a new locus
                block['CHROM_a'] = curr_row.CHROM_a
                block['POS_a_start'] = curr_row.POS_a
                block['POS_a_end'] = curr_row.POS_a
                block['CHROM_b'] = curr_row.CHROM_b
                block['POS_b_start'] = curr_row.POS_b
                block['POS_b_end'] = curr_row.POS_b
                block['skew'] = curr_row.skew
        else:
            # Walked into a new chromosome, add the coord-pair for the locus to the list of coord-pairs for the sample
            #loci[key].append([start_pos.CHROM_a, np.mean([start_pos.POS_a, end_pos.POS_a]),
            #                       start_pos.CHROM_b, np.mean([start_pos.POS_b, end_pos.POS_b])])
            loci[key].append([block['CHROM_a'], block['POS_a_start'], block['POS_a_end'],
                              block['CHROM_b'], block['POS_b_start'], block['POS_b_end'],
                              block['skew']])
            # Now reset the start & end_pos, because we are at a new locus
            block['CHROM_a'] = curr_row.CHROM_a
            block['POS_a_start'] = curr_row.POS_a
            block['POS_a_end'] = curr_row.POS_a
            block['CHROM_b'] = curr_row.CHROM_b
            block['POS_b_start'] = curr_row.POS_b
            block['POS_b_end'] = curr_row.POS_b
            block['skew'] = curr_row.skew
        # No matter what else happened up above, the curr_row is now the prev_row
        prev_row = curr_row
    # Now we're done with this sample, so let's make this list of lists into a DataFrame
    loci[key] = pd.DataFrame(data=loci[key], columns=['CHROM_a', 'POS_a_start', 'POS_a_end',
                                                                 'CHROM_b', 'POS_b_start', 'POS_b_end', 'skew'])

In [21]:
loci['N01'][0:10]

Unnamed: 0,CHROM_a,POS_a_start,POS_a_end,CHROM_b,POS_b_start,POS_b_end,skew
0,chr01,20053392,22113938,chr02,31482718,34030257,BB_obs
1,chr01,22148052,22148052,chr02,18646610,18686916,BB_obs
2,chr01,22148052,22148052,chr02,29563080,34030257,BB_obs
3,chr01,22148902,22148902,chr02,18646610,18686916,BB_obs
4,chr01,22148902,22277753,chr02,29563080,34030257,BB_obs
5,chr01,22313907,22313907,chr02,18646610,18686916,BB_obs
6,chr01,22313907,22313907,chr02,29370785,34030257,BB_obs
7,chr01,22319807,22319807,chr02,18646610,18686916,BB_obs
8,chr01,22319807,22336584,chr02,29370785,34030257,BB_obs
9,chr01,22352813,22352813,chr02,18646610,18686916,BB_obs


In [22]:
loci['N01'].shape

(15989, 7)

In [23]:
for key in loci.keys():
    loci[key].to_csv('interchromosomal_linkage_analysis/{}_ICLD_gtskew_5M_windows.tsv'.format(key), sep='\t', index=False)

#### Searching for lethal combinations
To identify lethal gene combinations, we can examine instances where homozygosity for either parent's markers are not seen within a population. This is _prima facie_ evidence that such a combination is non-viable (i.e. - lethal) because we should expect to see pairs of non-linked homozygous loci appear in even proportion to Mendelian segregation.

In [45]:
loci_of_interest = OrderedDict()
for key in samples:
    loci_of_interest[key] = significant_results[key].loc[(significant_results[key].AA_obs == 0) | (significant_results[key].BB_obs == 0)]

In [56]:
# Let's remove keys from the dictionary where the shape of the dataframe is 0 rows, 
# meaning there were no instances of potentially lethal genotypes seen between loci.
for key in samples:
    if loci_of_interest[key].shape[0] == 0:
        loci_of_interest.pop(key, None)

In [57]:
# See which samples remain from the original set
loci_of_interest.keys()

odict_keys(['N01', 'N07', 'N08', 'N09', 'N14', 'N16', 'N17'])

In [59]:
# Tell me which direction the genome skews for each remaining sample
for key in loci_of_interest:
    print('{} skew values present {}.'.format(key, set(loci_of_interest['N01']['skew'])))

N01 skew values present {'AA_obs', 'AB_obs'}.
N07 skew values present {'AA_obs', 'AB_obs'}.
N08 skew values present {'AA_obs', 'AB_obs'}.
N09 skew values present {'AA_obs', 'AB_obs'}.
N14 skew values present {'AA_obs', 'AB_obs'}.
N16 skew values present {'AA_obs', 'AB_obs'}.
N17 skew values present {'AA_obs', 'AB_obs'}.
