# Analysis

This notebook contains the code necessary to conduct all data analyses for the Pan 3D Genome project. I have organized this largely by order of appearance in the manuscript; however, a few sections may be out of order. Use the Table of Contents below to navigate to specific analyses.

Note that the genomic windows used to predict the 3D genome are in 0-based coordinates.

## Table of Contents

- [Load Packages and Main Dataframe](#loaddataframe)
- [Data Description](#datadescription)
    - [N Windows and Comparisons](#nwindowscomparisons)
    - [Divergence Score Distribution](#divergencescoredistribution)
- [Pan-Homo Divergence Score Distribution Comparison](#panhomodivergencescores)
- [Minimally Divergent Windows and Ultraconserved TAD Boundaries](#okhovatetal2023)
- [Lineage Comparison](#lineagecomparison)
- [Hierarchical Clustering](#hierachicalclustering)
- [Bonobo-Chimpanzee Windows](#bonobochimpanzeewindows)

## Load Packages and Main Dataframe <a class = 'anchor' id = 'loaddataframe'></a>

Load all needed packages, change directories, and load the main dataframe (HFF) that we previously generated.

In [1]:
import json
import numpy as np
import pandas as pd
import pybedtools

from scipy.stats import fisher_exact
from scipy.stats import kstest
from scipy.stats import kruskal
from scipy.stats import mannwhitneyu
from scipy.stats import spearmanr

pd.options.display.max_columns = 100
pd.options.display.max_rows = 500

In [2]:
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import pdist

In [3]:
import matplotlib.font_manager as font_manager
arial_path = '/wynton/home/capra/cbrand/miniconda3/envs/jupyter/fonts/Arial.ttf'
arial = font_manager.FontProperties(fname = arial_path)

In [4]:
cd ../../data

/wynton/group/capra/projects/pan_3d_genome/data


In [5]:
comparisons = pd.read_csv('dataframes/HFF_comparisons.txt', sep = '\t', header = 0)
comparisons.head(5)

Unnamed: 0,ind1,ind2,dyad_type,chr,window_start,window,mse,spearman,divergence,seq_diff
0,Akwaya-Jean,Alfred,pte-ptt,chr10,1572864,chr10_1572864,0.000168,0.987655,0.012345,2803
1,Akwaya-Jean,Alfred,pte-ptt,chr10,2097152,chr10_2097152,0.000481,0.969809,0.030191,2715
2,Akwaya-Jean,Alfred,pte-ptt,chr10,2621440,chr10_2621440,0.001675,0.996398,0.003602,2849
3,Akwaya-Jean,Alfred,pte-ptt,chr10,3145728,chr10_3145728,0.000323,0.997899,0.002101,2606
4,Akwaya-Jean,Alfred,pte-ptt,chr10,3670016,chr10_3670016,0.000143,0.996732,0.003268,2594


Double check that the frame is the size it should be.

In [6]:
len(comparisons)

6669390

## Data Description <a class = 'anchor' id = 'datadescription'></a>

### N Windows and Comparisons <a class = 'anchor' id = 'nwindowscomparisons'></a>

Let's generate some basic statistics for this dataset before diving into specific analyses. First, let's confirm the total number of windows and the number of windows per chromosome.

In [7]:
len(comparisons['window'].unique())

4420

In [8]:
windows_df = pd.DataFrame(comparisons['window'].unique(), columns = ['window'])
windows_df = windows_df['window'].str.split('_', expand = True).rename(columns = {0:'chr', 1:'window_start'})
windows_df.groupby(['chr']).size().to_frame('N')

Unnamed: 0_level_0,N
chr,Unnamed: 1_level_1
chr1,341
chr10,208
chr11,226
chr12,210
chr13,166
chr14,151
chr15,120
chr16,102
chr17,99
chr18,126


How many comparisons are there per autosomal window? X chromosome window?

In [9]:
len(comparisons[comparisons['window'] == 'chr1_1048576'])

1540

In [10]:
len(comparisons[comparisons['window'] == 'chrX_4718592'])

630

How many comparisons are there per chromosome?

In [11]:
comparisons.groupby('chr').size().to_frame('N')

Unnamed: 0_level_0,N
chr,Unnamed: 1_level_1
chr1,525140
chr10,320320
chr11,348040
chr12,323400
chr13,255640
chr14,232540
chr15,184800
chr16,157080
chr17,152460
chr18,194040


How many dyad comparisons are there for a single autosomal window?

In [12]:
comparisons[comparisons['window'] == 'chr1_1048576'].groupby('dyad_type').size().to_frame('N').sort_values(by = 'N', ascending = False)

Unnamed: 0_level_0,N
dyad_type,Unnamed: 1_level_1
ppn-pt,423
pts-ptt,272
pts-ptv,153
ptt-ptv,144
pts,136
ptt,120
pte-pts,85
pte-ptt,80
pte-ptv,45
ppn,36


How many dyads comparisons are there for a single chrX window?

In [13]:
comparisons[comparisons['window'] == 'chrX_4718592'].groupby('dyad_type').size().to_frame('N').sort_values(by = 'N', ascending = False)

Unnamed: 0_level_0,N
dyad_type,Unnamed: 1_level_1
ppn-pt,203
pts-ptt,121
pts,55
pts-ptv,55
ptt,55
ptt-ptv,55
pte-pts,22
pte-ptt,22
ppn,21
pte-ptv,10


### Divergence Score Distribution <a class = 'anchor' id = 'divergencescoredistribution'></a>

What does the distribution of divergence scores look like?

In [14]:
comparisons['divergence'].min()

2.492599999737166e-07

In [15]:
comparisons['divergence'].max()

0.873604563894

How many values are less than 0.01?

In [16]:
len(comparisons[comparisons['divergence'] < 0.01])

5539567

In [17]:
5539567/6669390

0.830595751635457

What is the maximum divergence score per window?

In [18]:
maxes = comparisons.groupby(['window'])['divergence'].max().to_frame('max').reset_index()
maxes

Unnamed: 0,window,max
0,chr10_100139008,0.128959
1,chr10_100663296,0.101497
2,chr10_101187584,0.033608
3,chr10_101711872,0.138848
4,chr10_102236160,0.031867
...,...,...
4415,chrX_94371840,0.076931
4416,chrX_94896128,0.044540
4417,chrX_95420416,0.002984
4418,chrX_99090432,0.030925


How many windows have a maximum divergence of 0.005 and 0.01? What is the maximum divergence for various quantiles?

In [19]:
len(maxes[maxes['max'] < 0.005])

245

In [20]:
len(maxes[maxes['max'] < 0.01])

874

In [21]:
maxes['max'].quantile(q = 0.1)

0.006717990286899901

In [22]:
maxes['max'].quantile(q = 0.25)

0.0119926032

In [23]:
maxes['max'].quantile(q = 0.5)

0.02476536453449995

In [24]:
maxes['max'].quantile(q = 0.9)

0.11610571464919983

Get divergence scores for Figure 1C examples.

In [25]:
comparisons[(comparisons['window'] == 'chr4_77594624') & (comparisons['ind1'] == 'Maya') & (comparisons['ind2'] == 'Washu')]

Unnamed: 0,ind1,ind2,dyad_type,chr,window_start,window,mse,spearman,divergence,seq_diff
6211267,Maya,Washu,pts,chr4,77594624,chr4_77594624,1.090777e-07,1.0,4.52621e-07,11


In [26]:
comparisons[(comparisons['window'] == 'chr18_46137344') & (comparisons['ind1'] == 'Julie-A959') & (comparisons['ind2'] == 'Vincent')]

Unnamed: 0,ind1,ind2,dyad_type,chr,window_start,window,mse,spearman,divergence,seq_diff
5136420,Julie-A959,Vincent,pts-ptt,chr18,46137344,chr18_46137344,0.020585,0.350611,0.649389,3049


## Pan - Homo Divergence Score Distribution Comparison <a class = 'anchor' id = 'panhomodivergencescores'></a>

Compare the Pan divergence score distribution to the distribution calculated from 130 modern human genomes.

In [27]:
human_comparisons = pd.read_csv('comparisons/thousand_genomes_subset_HFF/melted_130v130_1KG_subsample.csv', sep = ',', header = 0)
human_comparisons.head(5)

Unnamed: 0,ind1,ind2,chrm,start_pos,3d_divergence
0,AFR_ASW_female_NA19917,AFR_ASW_female_NA19901,chr1,1048576,0.002916
1,AFR_ASW_female_NA19917,AFR_ASW_female_NA20314,chr1,1048576,0.003368
2,AFR_ASW_female_NA19917,AFR_ASW_female_NA20317,chr1,1048576,0.005616
3,AFR_ASW_female_NA19917,AFR_ASW_female_NA19625,chr1,1048576,0.001083
4,AFR_ASW_female_NA19917,AFR_ACB_female_HG02337,chr1,1048576,0.004772


In [28]:
len(human_comparisons)

40860105

In [29]:
kstest(comparisons['divergence'], human_comparisons['3d_divergence'])

KstestResult(statistic=0.32773314181724106, pvalue=0.0)

## Minimally Divergent Windows and Ultraconserved TAD Boundaries <a class = 'anchor' id = 'okhovatetal2023'></a>

Let's consider the overlap between minimally divergent windows and ultra- and primate-conserved boundaries from Okhovat et al. 2023. First, let's read in the ultraconserved and primate-conserved files.

In [30]:
ultraconserved_pbtBED = pybedtools.BedTool('Okhovat_et_al_2023/Okhovat_et_al_2023_UltraConserved_Boundaries_panTro6.bed')
ultraconserved_pbtBED.head(5)

chr1	1098611	1204332
 chr1	2029035	2038921
 chr1	5402415	5412399
 chr1	7054986	7064987
 chr1	7946002	7956016
 

In [31]:
len(ultraconserved_pbtBED)

1012

In [32]:
primate_conserved_pbtBED = pybedtools.BedTool('Okhovat_et_al_2023/Okhovat_et_al_2023_Primate_Conserved_Boundaries_panTro6.bed')
primate_conserved_pbtBED.head(5)

chr1	13254423	13264412
 chr1	13745195	13755219
 chr1	14428008	14437197
 chr1	17375951	17385984
 chr1	24694359	24704375
 

In [33]:
len(primate_conserved_pbtBED)

486

Does these sets overlap at all?

In [34]:
ultraconserved_primate_conserved_intersect = ultraconserved_pbtBED.intersect(primate_conserved_pbtBED)

In [35]:
len(ultraconserved_primate_conserved_intersect)

0

Intersect the TAD boundaries with the available panTro6 windows to determine the maximum amount of overlap.

In [36]:
windows_pbtBED = pybedtools.BedTool('metadata/panTro6_windows_with_full_coverage.bed')
windows_pbtBED.head(5)

chr1	1048576	2097152
 chr1	1572864	2621440
 chr1	2097152	3145728
 chr1	2621440	3670016
 chr1	3145728	4194304
 

In [37]:
len(ultraconserved_pbtBED.intersect(windows_pbtBED, u = True))

951

In [38]:
len(primate_conserved_pbtBED.intersect(windows_pbtBED, u = True))

438

Write a function to quantify the percent of ultraconserved and primate-conserved windows that overlap the minimally divergent windows (MDWs) at a given MDW threshold.

In [39]:
def conserved_MDW_intersect(cutoff):
    MDWs = maxes[maxes['max'] < cutoff]
    MDWs = MDWs['window'].str.split('_', expand = True).rename(columns = {0:'chr', 1:'start'})
    MDWs['start'] = MDWs['start'].astype(int)
    MDWs['end'] = MDWs['start'] + 1048576
    MDWs_pbtBED = pybedtools.BedTool().from_dataframe(MDWs).sort()
    
    ultraconserved_MDWs_intersect = ultraconserved_pbtBED.intersect(MDWs_pbtBED, c = True).to_dataframe(names=['chr','start','end','count'])
    primate_conserved_MDWs_intersect = primate_conserved_pbtBED.intersect(MDWs_pbtBED, c = True).to_dataframe(names=['chr','start','end','count'])
    
    n_ultraconserved = len(ultraconserved_MDWs_intersect[ultraconserved_MDWs_intersect['count'] > 0])
    n_primate_conserved = len(primate_conserved_MDWs_intersect[primate_conserved_MDWs_intersect['count'] > 0])
    
    n_ultraconserved_prop = n_ultraconserved/951
    n_primate_conserved_prop = n_primate_conserved/438
    
    return n_ultraconserved, n_ultraconserved_prop, n_primate_conserved, n_primate_conserved_prop

In [40]:
conserved_MDW_intersect(0.01)

(387, 0.4069400630914827, 164, 0.3744292237442922)

In [41]:
conserved_MDW_intersect(0.02)

(687, 0.722397476340694, 299, 0.682648401826484)

In [42]:
conserved_MDW_intersect(0.03)

(809, 0.85068349106204, 352, 0.8036529680365296)

In [43]:
conserved_MDW_intersect(0.04)

(866, 0.9106203995793901, 385, 0.8789954337899544)

In [44]:
conserved_MDW_intersect(0.05)

(900, 0.9463722397476341, 400, 0.91324200913242)

What is the Pan divergence distribution for both sets of regions? Start by creating a pybedtools object for the maxes dataframe.

In [45]:
window_max_divergence = maxes['max']
maxes_BED = maxes['window'].str.split('_', expand = True).rename(columns = {0:'chr', 1:'start'})
maxes_BED['start'] = maxes_BED['start'].astype(int)
maxes_BED['end'] = maxes_BED['start'] + 1048576
maxes_BED['max'] = window_max_divergence
maxes_BED.head(5)

Unnamed: 0,chr,start,end,max
0,chr10,100139008,101187584,0.128959
1,chr10,100663296,101711872,0.101497
2,chr10,101187584,102236160,0.033608
3,chr10,101711872,102760448,0.138848
4,chr10,102236160,103284736,0.031867


In [46]:
len(maxes_BED)

4420

In [47]:
maxes_pbtBED = pybedtools.BedTool().from_dataframe(maxes_BED).sort()
maxes_pbtBED.head(5)

chr1	1048576	2097152	0.036526326318
 chr1	1572864	2621440	0.006578744897
 chr1	2097152	3145728	0.015877001035
 chr1	2621440	3670016	0.079123467404
 chr1	3145728	4194304	0.192662189319
 

Now intersect with the ultraconserved regions. Some TAD boundaries may overlie two overlapping windows. We will compute a centrality score and use the maximum divergence from the window in which the TAD boundary is most central.

The centrality score below is computed as:

$$ \mathrm{Centrality\:score} = |0.5 - \frac{(\frac{\mathrm{TAD\:Boundary\:End\:-\:TAD\:Boundary\:Start}}{2} + \mathrm{TAD\:Boundary\:Start}) - \mathrm{Window\:Start}}{1048576}|$$ 

Scores at or near 0 indicate the TAD boundary is more central to that window where values approaching 0.5 are at the window's edge.

Run the intersection for the ultraconserved.

In [48]:
ultraconserved_maxes_intersect = ultraconserved_pbtBED.intersect(maxes_pbtBED, wa = True, wb = True).to_dataframe(names=['tb_chr','tb_start','tb_end','window_chr','window_start','window_end','max'])
ultraconserved_maxes_intersect['centrality_score'] = abs(0.5 - (((((ultraconserved_maxes_intersect['tb_end'] - ultraconserved_maxes_intersect['tb_start']) / 2) + ultraconserved_maxes_intersect['tb_start']) - ultraconserved_maxes_intersect['window_start']) / 1048576))  # add a center score to assess centrality of TAD boundary when one overlaps multiple 3D windows
ultraconserved_maxes_intersect = ultraconserved_maxes_intersect.sort_values(by=['tb_chr','tb_start','centrality_score'], ascending = True)
ultraconserved_maxes_intersect.head(5)

Unnamed: 0,tb_chr,tb_start,tb_end,window_chr,window_start,window_end,max,centrality_score
0,chr1,1098611,1204332,chr1,1048576,2097152,0.036526,0.401871
2,chr1,2029035,2038921,chr1,1572864,2621440,0.006579,0.060247
1,chr1,2029035,2038921,chr1,1048576,2097152,0.036526,0.439753
3,chr1,5402415,5412399,chr1,5242880,6291456,0.017641,0.343095
4,chr1,7054986,7064987,chr1,6291456,7340032,0.002978,0.232928


In [49]:
ultraconserved_maxes_intersect = ultraconserved_maxes_intersect.drop_duplicates(subset = ['tb_chr','tb_start','tb_end'], keep = 'first')
ultraconserved_maxes_intersect.head(5)

Unnamed: 0,tb_chr,tb_start,tb_end,window_chr,window_start,window_end,max,centrality_score
0,chr1,1098611,1204332,chr1,1048576,2097152,0.036526,0.401871
2,chr1,2029035,2038921,chr1,1572864,2621440,0.006579,0.060247
3,chr1,5402415,5412399,chr1,5242880,6291456,0.017641,0.343095
4,chr1,7054986,7064987,chr1,6291456,7340032,0.002978,0.232928
6,chr1,7946002,7956016,chr1,7340032,8388608,0.021275,0.082673


In [50]:
len(ultraconserved_maxes_intersect)

951

In [51]:
ultraconserved_maxes_intersect['max'].to_csv('Okhovat_et_al_2023/ultraconserved_window_maxes.txt', sep = '\t', header = False, index = False)

Now the primate-conserved.

In [52]:
primate_conserved_maxes_intersect = primate_conserved_pbtBED.intersect(maxes_pbtBED, wa = True, wb = True).to_dataframe(names=['tb_chr','tb_start','tb_end','window_chr','window_start','window_end','max'])
primate_conserved_maxes_intersect['centrality_score'] = abs(0.5 - (((((primate_conserved_maxes_intersect['tb_end'] - primate_conserved_maxes_intersect['tb_start']) / 2) + primate_conserved_maxes_intersect['tb_start']) - primate_conserved_maxes_intersect['window_start']) / 1048576))  # add a center score to assess centrality of TAD boundary when one overlaps multiple 3D windows
primate_conserved_maxes_intersect = primate_conserved_maxes_intersect.sort_values(by=['tb_chr','tb_start','centrality_score'], ascending = True)
primate_conserved_maxes_intersect = primate_conserved_maxes_intersect.drop_duplicates(subset = ['tb_chr','tb_start','tb_end'], keep = 'first')
primate_conserved_maxes_intersect.head(5)

Unnamed: 0,tb_chr,tb_start,tb_end,window_chr,window_start,window_end,max,centrality_score
0,chr1,13254423,13264412,chr1,12582912,13631488,0.140253,0.145166
3,chr1,13745195,13755219,chr1,13107200,14155776,0.039724,0.113219
4,chr1,14428008,14437197,chr1,13631488,14680064,0.009116,0.264002
5,chr1,17375951,17385984,chr1,16777216,17825792,0.028705,0.075782
7,chr1,24694359,24704375,chr1,24117248,25165824,0.015984,0.055152


In [53]:
len(primate_conserved_maxes_intersect)

438

In [54]:
primate_conserved_maxes_intersect['max'].to_csv('Okhovat_et_al_2023/primate_conserved_window_maxes.txt', sep = '\t', header = False, index = False)

Are these distributions different? Run a KS test. First the ultraconserved.

In [55]:
kstest(maxes['max'], ultraconserved_maxes_intersect['max'])

KstestResult(statistic=0.18061959071422795, pvalue=8.525650513335447e-23)

In [56]:
maxes['max'].mean()

0.05016095426516081

In [57]:
ultraconserved_maxes_intersect['max'].mean()

0.024565984185433182

Now the primate-conserved.

In [58]:
kstest(maxes['max'], primate_conserved_maxes_intersect['max'])

KstestResult(statistic=0.1275119320647121, pvalue=4.1411107067604456e-06)

In [59]:
primate_conserved_maxes_intersect['max'].mean()

0.03270146757034698

Export the window maxes as a genome-wide background for visualization.

In [60]:
maxes['max'].to_csv('Okhovat_et_al_2023/window_maxes.txt', sep = '\t', header = False, index = False)

## Lineage Comparison <a class = 'anchor' id = 'lineagecomparison'></a>

Let's look at the median 3D divergence by dyad type. We will simplify things by collapsing when chimpanzees of different subspecies are compared.

In [61]:
simple_dyad_comparisons = comparisons[['ind1','ind2','dyad_type','divergence']].copy()
simple_dyad_comparisons['dyad_type'] = simple_dyad_comparisons['dyad_type'].replace({'pte-pts':'pt-pt', 'pte-ptt':'pt-pt', 'pte-ptv':'pt-pt', 'pts-ptt':'pt-pt', 'pts-ptv':'pt-pt', 'ptt-ptv':'pt-pt'})
simple_dyad_comparisons.head(5)

Unnamed: 0,ind1,ind2,dyad_type,divergence
0,Akwaya-Jean,Alfred,pt-pt,0.012345
1,Akwaya-Jean,Alfred,pt-pt,0.030191
2,Akwaya-Jean,Alfred,pt-pt,0.003602
3,Akwaya-Jean,Alfred,pt-pt,0.002101
4,Akwaya-Jean,Alfred,pt-pt,0.003268


In [62]:
simple_dyad_comparisons.groupby('dyad_type')['divergence'].median().to_frame('median')

Unnamed: 0_level_0,median
dyad_type,Unnamed: 1_level_1
ppn,0.000829
ppn-pt,0.00413
pt-pt,0.002331
pte,0.001282
pts,0.001629
ptt,0.002165
ptv,0.00059


## Hierarchical Clustering <a class = 'anchor' id = 'hierachicalclustering'></a>

Run a script to generate trees per window, grab the cluster identity per individual, and catch the y-coordinates of the top four nodes in the tree. 

In [63]:
complete_linkage_trees = pd.read_csv('dataframes/complete_linkage_clustering_per_window.txt', sep = '\t', header = None)

In [64]:
autosomal_complete_linkage_trees = complete_linkage_trees[~complete_linkage_trees[0].str.startswith('chrX_')]
chrX_complete_linkage_trees = complete_linkage_trees[complete_linkage_trees[0].str.startswith('chrX_')]

In [65]:
autosomal_trees_header = ['window','Akwaya-Jean','Alfred','Alice','Andromeda','Athanga','Berta','Bihati','Blanquita','Bono','Bosco','Brigitta','Bwamble','Cindy-schwein','Cindy-troglodytes','Cindy-verus','Cleo','Clint','Coco-chimp','Damian','Desmond','Doris','Dzeeta','Frederike','Gamin','Hermien','Hortense','Ikuru','Jimmie','Julie-A959','Julie-LWC21','Kidongo','Koby','Kombote','Kosana','Koto','Kumbuka','Lara','Linda','Luky','Marlin','Maya','Mgbadolite','Mirinda','Nakuu','Natalie','Negrita','SeppToni','Taweh','Tibe','Tongo','Trixie','Ula','Vaillant','Vincent','Washu','Yogui','node_0','node_1','node_2','node_3']
chrX_trees_header = ['window','Alice','Andromeda','Berta','Bihati','Blanquita','Cindy-schwein','Cindy-troglodytes','Cindy-verus','Cleo','Coco-chimp','Doris','Dzeeta','Frederike','Hermien','Hortense','Ikuru','Jimmie','Julie-A959','Julie-LWC21','Kidongo','Kombote','Kosana','Kumbuka','Lara','Linda','Luky','Marlin','Maya','Mirinda','Nakuu','Natalie','Negrita','Taweh','Tibe','Trixie','Ula','node_0','node_1','node_2','node_3']

autosomal_complete_linkage_trees.columns = autosomal_trees_header
chrX_complete_linkage_trees = chrX_complete_linkage_trees.dropna(axis = 1)
chrX_complete_linkage_trees.columns = chrX_trees_header

In [66]:
autosomal_complete_linkage_trees.head(5)

Unnamed: 0,window,Akwaya-Jean,Alfred,Alice,Andromeda,Athanga,Berta,Bihati,Blanquita,Bono,Bosco,Brigitta,Bwamble,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Clint,Coco-chimp,Damian,Desmond,Doris,Dzeeta,Frederike,Gamin,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Koby,Kombote,Kosana,Koto,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mgbadolite,Mirinda,Nakuu,Natalie,Negrita,SeppToni,Taweh,Tibe,Tongo,Trixie,Ula,Vaillant,Vincent,Washu,Yogui,node_0,node_1,node_2,node_3
0,chr10_1572864,C3,C3,C3,C2,C2,C3,C3,C2,C1,C3,C3,C2,C2,C3,C3,C2,C3,C2,C3,C1,C2,C1,C3,C3,C1,C1,C3,C3,C3,C3,C2,C3,C1,C1,C3,C1,C2,C3,C3,C3,C2,C2,C2,C2,C1,C2,C3,C3,C3,C2,C2,C3,C3,C2,C2,C2,0.014522,0.091748,0.091748,0.067514
1,chr10_2097152,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.038838,0.269334,0.269334,0.115208
2,chr10_2621440,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.005674,0.012824,0.012824,0.006737
3,chr10_3145728,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.002904,0.010145,0.010145,0.006418
4,chr10_3670016,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.00514,0.096248,0.096248,0.04808


In [67]:
chrX_complete_linkage_trees.head(5)

Unnamed: 0,window,Alice,Andromeda,Berta,Bihati,Blanquita,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Coco-chimp,Doris,Dzeeta,Frederike,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Kombote,Kosana,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mirinda,Nakuu,Natalie,Negrita,Taweh,Tibe,Trixie,Ula,node_0,node_1,node_2,node_3
4269,chrX_4718592,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.000386870517,0.0942329238279999,0.0942329238279999,0.0244069347169999
4270,chrX_5242880,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C0,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.0,0.018755481251,0.018755481251,0.0144252191229999
4271,chrX_6815744,C1,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C1,C3,C3,C3,C3,C2,C3,C3,C1,C3,C3,C3,C3,C3,C2,C3,C3,C3,C3,C3,0.0004348450219999,0.0176649495409999,0.0176649495409999,0.014919062537
4272,chrX_9437184,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.0653681878299999,0.0653681878299999,0.020195340661
4273,chrX_11010048,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,7.667215399997929e-05,0.063108737967,0.063108737967,0.0414393167169999


Write a function that creates a cluster composition string. We will sort clusters by increasing size and delineate them by a '/'.

In [68]:
def generate_autosomal_clusters_lists(dataframe, window):
    row_index = dataframe[dataframe['window'] == window].index
    row = dataframe.loc[row_index,['Akwaya-Jean','Alfred','Alice','Andromeda','Athanga','Berta','Bihati','Blanquita','Bono','Bosco','Brigitta','Bwamble','Cindy-schwein','Cindy-troglodytes','Cindy-verus','Cleo','Clint','Coco-chimp','Damian','Desmond','Doris','Dzeeta','Frederike','Gamin','Hermien','Hortense','Ikuru','Jimmie','Julie-A959','Julie-LWC21','Kidongo','Koby','Kombote','Kosana','Koto','Kumbuka','Lara','Linda','Luky','Marlin','Maya','Mgbadolite','Mirinda','Nakuu','Natalie','Negrita','SeppToni','Taweh','Tibe','Tongo','Trixie','Ula','Vaillant','Vincent','Washu','Yogui']].values[0].tolist()
    cluster_values = sorted(list(set(row)))
    clusters_list = []
    single_clusters = 0
    
    for c in cluster_values:
        cluster_samples = dataframe.loc[row_index].apply(lambda row: row[row == c].index, axis = 1).values[0].tolist()
        if len(cluster_samples) == 1:
            single_clusters += 1
        cluster_samples = ' '.join(cluster_samples)
        clusters_list.append(cluster_samples)
    
    clusters_list.sort(key = len)
    cluster_composition = '/'.join(clusters_list)
    return len(clusters_list), cluster_composition, single_clusters

In [69]:
def generate_chrX_clusters_lists(dataframe, window):
    row_index = dataframe[dataframe['window'] == window].index
    row = dataframe.loc[row_index,['Alice','Andromeda','Berta','Bihati','Blanquita','Cindy-schwein','Cindy-troglodytes','Cindy-verus','Cleo','Coco-chimp','Doris','Dzeeta','Frederike','Hermien','Hortense','Ikuru','Jimmie','Julie-A959','Julie-LWC21','Kidongo','Kombote','Kosana','Kumbuka','Lara','Linda','Luky','Marlin','Maya','Mirinda','Nakuu','Natalie','Negrita','Taweh','Tibe','Trixie','Ula']].values[0].tolist()
    cluster_values = sorted(list(set(row)))
    clusters_list = []
    single_clusters = 0
    
    for c in cluster_values:
        cluster_samples = dataframe.loc[row_index].apply(lambda row: row[row == c].index, axis = 1).values[0].tolist()
        if len(cluster_samples) == 1:
            single_clusters += 1
        cluster_samples = ' '.join(cluster_samples)
        clusters_list.append(cluster_samples)
    
    clusters_list.sort(key = len)
    cluster_composition = '/'.join(clusters_list)
    return len(clusters_list), cluster_composition, single_clusters

In [70]:
autosomal_complete_linkage_trees[['n_clusters','cluster_composition','single_clusters']] = pd.DataFrame(autosomal_complete_linkage_trees.apply(lambda row: generate_autosomal_clusters_lists(autosomal_complete_linkage_trees, row['window']), axis=1).tolist(), index=autosomal_complete_linkage_trees.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autosomal_complete_linkage_trees[['n_clusters','cluster_composition','single_clusters']] = pd.DataFrame(autosomal_complete_linkage_trees.apply(lambda row: generate_autosomal_clusters_lists(autosomal_complete_linkage_trees, row['window']), axis=1).tolist(), index=autosomal_complete_linkage_trees.index)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autosomal_complete_linkage_trees[['n_clusters','cluster_composition','single_clusters']] = pd.DataFrame(autosomal_complete_linkage_trees.apply(lambda row: generate_aut

In [71]:
autosomal_complete_linkage_trees.head(5)

Unnamed: 0,window,Akwaya-Jean,Alfred,Alice,Andromeda,Athanga,Berta,Bihati,Blanquita,Bono,Bosco,Brigitta,Bwamble,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Clint,Coco-chimp,Damian,Desmond,Doris,Dzeeta,Frederike,Gamin,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Koby,Kombote,Kosana,Koto,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mgbadolite,Mirinda,Nakuu,Natalie,Negrita,SeppToni,Taweh,Tibe,Tongo,Trixie,Ula,Vaillant,Vincent,Washu,Yogui,node_0,node_1,node_2,node_3,n_clusters,cluster_composition,single_clusters
0,chr10_1572864,C3,C3,C3,C2,C2,C3,C3,C2,C1,C3,C3,C2,C2,C3,C3,C2,C3,C2,C3,C1,C2,C1,C3,C3,C1,C1,C3,C3,C3,C3,C2,C3,C1,C1,C3,C1,C2,C3,C3,C3,C2,C2,C2,C2,C1,C2,C3,C3,C3,C2,C2,C3,C3,C2,C2,C2,0.014522,0.091748,0.091748,0.067514,3,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1,chr10_2097152,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.038838,0.269334,0.269334,0.115208,2,Cindy-troglodytes Doris Luky Marlin/Akwaya-Jea...,0
2,chr10_2621440,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.005674,0.012824,0.012824,0.006737,2,Cindy-troglodytes Doris Luky Marlin/Akwaya-Jea...,0
3,chr10_3145728,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.002904,0.010145,0.010145,0.006418,2,Alice Cindy-verus Luky/Akwaya-Jean Alfred Andr...,0
4,chr10_3670016,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.00514,0.096248,0.096248,0.04808,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0


In [72]:
chrX_complete_linkage_trees[['n_clusters','cluster_composition','single_clusters']]= pd.DataFrame(chrX_complete_linkage_trees.apply(lambda row: generate_chrX_clusters_lists(chrX_complete_linkage_trees, row['window']), axis=1).tolist(), index=chrX_complete_linkage_trees.index)

In [73]:
chrX_complete_linkage_trees.head(5)

Unnamed: 0,window,Alice,Andromeda,Berta,Bihati,Blanquita,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Coco-chimp,Doris,Dzeeta,Frederike,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Kombote,Kosana,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mirinda,Nakuu,Natalie,Negrita,Taweh,Tibe,Trixie,Ula,node_0,node_1,node_2,node_3,n_clusters,cluster_composition,single_clusters
4269,chrX_4718592,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.000386870517,0.0942329238279999,0.0942329238279999,0.0244069347169999,2,Alice Berta/Andromeda Bihati Blanquita Cindy-s...,0
4270,chrX_5242880,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C0,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.0,0.018755481251,0.018755481251,0.0144252191229999,3,Julie-A959/Doris Luky/Alice Andromeda Berta Bi...,1
4271,chrX_6815744,C1,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C3,C1,C3,C3,C3,C3,C2,C3,C3,C1,C3,C3,C3,C3,C3,C2,C3,C3,C3,C3,C3,0.0004348450219999,0.0176649495409999,0.0176649495409999,0.014919062537,3,Kosana Natalie/Alice Jimmie Linda/Andromeda Be...,0
4272,chrX_9437184,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.0653681878299999,0.0653681878299999,0.020195340661,2,Luky/Alice Andromeda Berta Bihati Blanquita Ci...,1
4273,chrX_11010048,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,7.667215399997929e-05,0.063108737967,0.063108737967,0.0414393167169999,2,Andromeda Maya/Alice Berta Bihati Blanquita Ci...,0


Concat the dataframes.

In [74]:
complete_linkage_trees = pd.concat([autosomal_complete_linkage_trees, chrX_complete_linkage_trees])

In [75]:
len(complete_linkage_trees)

4420

Take a look at the number of clusters.

In [76]:
complete_linkage_trees.groupby(['n_clusters'])['window'].count().to_frame('N')

Unnamed: 0_level_0,N
n_clusters,Unnamed: 1_level_1
2,3622
3,748
4,46
5,4


Let's get a table of unique cluster compositions for each set.

In [77]:
autosomal_complete_summary = autosomal_complete_linkage_trees.groupby(['cluster_composition'])['window'].count().to_frame('N')
chrX_complete_summary = chrX_complete_linkage_trees.groupby(['cluster_composition'])['window'].count().to_frame('N')

In [78]:
len(autosomal_complete_summary)

2744

In [79]:
len(chrX_complete_summary)

104

In [80]:
autosomal_complete_summary.sort_values(by = 'N', ascending = False).head(20)

Unnamed: 0_level_0,N
cluster_composition,Unnamed: 1_level_1
Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Doris Frederike Gamin Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Koto Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,315
Luky/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,33
Desmond/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,30
Doris/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,29
Marlin/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,29
Koto/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,21
Alfred/Akwaya-Jean Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,21
Hortense/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,20
Cindy-troglodytes/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,20
Lara/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui,19


Let's look for different topologies. Starting with individually driven windows where a single individual is divergent to all others.

In [81]:
IDW_windows = complete_linkage_trees[(complete_linkage_trees['n_clusters'] == 2) & (complete_linkage_trees['single_clusters'] == 1)]

In [82]:
len(IDW_windows)

800

In [83]:
IDW_windows.head(5)

Unnamed: 0,window,Akwaya-Jean,Alfred,Alice,Andromeda,Athanga,Berta,Bihati,Blanquita,Bono,Bosco,Brigitta,Bwamble,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Clint,Coco-chimp,Damian,Desmond,Doris,Dzeeta,Frederike,Gamin,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Koby,Kombote,Kosana,Koto,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mgbadolite,Mirinda,Nakuu,Natalie,Negrita,SeppToni,Taweh,Tibe,Tongo,Trixie,Ula,Vaillant,Vincent,Washu,Yogui,node_0,node_1,node_2,node_3,n_clusters,cluster_composition,single_clusters
11,chr10_8388608,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.144765,0.144765,0.093802,2,SeppToni/Akwaya-Jean Alfred Alice Andromeda At...,1
15,chr10_10485760,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.056945,0.056945,0.01485,2,Desmond/Akwaya-Jean Alfred Alice Andromeda Ath...,1
16,chr10_11010048,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.019138,0.019138,0.011459,2,Desmond/Akwaya-Jean Alfred Alice Andromeda Ath...,1
29,chr10_20971520,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.024704,0.024704,0.01044,2,Kidongo/Akwaya-Jean Alfred Alice Andromeda Ath...,1
30,chr10_21495808,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.017877,0.017877,0.005218,2,Kidongo/Akwaya-Jean Alfred Alice Andromeda Ath...,1


What is the distribution of individuals in the single individual clusters?

In [84]:
IDW_individuals = IDW_windows['window'].copy().to_frame('window')
IDW_individuals['IDW_ind'] = IDW_windows['cluster_composition'].str.split('/').str[0]
IDW_individuals.head(5)

Unnamed: 0,window,IDW_ind
11,chr10_8388608,SeppToni
15,chr10_10485760,Desmond
16,chr10_11010048,Desmond
29,chr10_20971520,Kidongo
30,chr10_21495808,Kidongo


In [85]:
IDW_individuals_summary = IDW_individuals.groupby(['IDW_ind']).size().to_frame('N')
IDW_individuals_summary.head(5)

Unnamed: 0_level_0,N
IDW_ind,Unnamed: 1_level_1
Akwaya-Jean,16
Alfred,21
Alice,9
Andromeda,12
Athanga,13


In [86]:
len(IDW_individuals_summary)

56

In [87]:
IDW_individuals_summary[IDW_individuals_summary['N'] == IDW_individuals_summary['N'].min()]

Unnamed: 0_level_0,N
IDW_ind,Unnamed: 1_level_1
Mgbadolite,1


In [88]:
IDW_individuals_summary[IDW_individuals_summary['N'] == IDW_individuals_summary['N'].max()]

Unnamed: 0_level_0,N
IDW_ind,Unnamed: 1_level_1
Luky,34


In [89]:
IDW_individuals_summary.mean()

N    14.285714
dtype: float64

There is some inter-individual variation in the number of windows where an individual is highly divergent. These windows may also vary considerably in their maximum depth so let's assess the range of max divergence per individual. We will also confirm that the individual is part of the pair with the maximum.

In [90]:
maxes_with_pair_IDs = comparisons.loc[comparisons.groupby('window')['divergence'].idxmax(), ['window', 'ind1', 'ind2', 'divergence']]
IDW_maxes_with_pair_IDs = maxes_with_pair_IDs[maxes_with_pair_IDs['window'].isin(IDW_windows['window'])]
IDW_maxes_with_pair_IDs = IDW_maxes_with_pair_IDs.merge(IDW_individuals, on = 'window')
IDW_maxes_with_pair_IDs.head(5)

Unnamed: 0,window,ind1,ind2,divergence,IDW_ind
0,chr10_100663296,Alfred,Lara,0.101497,Alfred
1,chr10_102236160,Frederike,Lara,0.031867,Lara
2,chr10_102760448,Bono,Bwamble,0.082444,Bwamble
3,chr10_10485760,Desmond,Julie-A959,0.056945,Desmond
4,chr10_106954752,Kumbuka,Washu,0.340326,Kumbuka


Let's check if there are any rows where the IDW individual does not match the most divergent comparison.

In [91]:
IDW_maxes_with_pair_IDs = IDW_maxes_with_pair_IDs[(IDW_maxes_with_pair_IDs['IDW_ind'] == IDW_maxes_with_pair_IDs['ind1']) | (IDW_maxes_with_pair_IDs['IDW_ind'] == IDW_maxes_with_pair_IDs['ind2'])]

In [92]:
len(IDW_maxes_with_pair_IDs)

800

Everything looks good! Let's identify the min, average, and max divergence score per individual in windows where they are highly divergent.

In [93]:
IDW_maxes_with_pair_IDs.head(5)

Unnamed: 0,window,ind1,ind2,divergence,IDW_ind
0,chr10_100663296,Alfred,Lara,0.101497,Alfred
1,chr10_102236160,Frederike,Lara,0.031867,Lara
2,chr10_102760448,Bono,Bwamble,0.082444,Bwamble
3,chr10_10485760,Desmond,Julie-A959,0.056945,Desmond
4,chr10_106954752,Kumbuka,Washu,0.340326,Kumbuka


In [94]:
IDW_n_per_ind = IDW_maxes_with_pair_IDs.groupby(['IDW_ind'])['divergence'].count()
IDW_min_per_ind = IDW_maxes_with_pair_IDs.groupby(['IDW_ind'])['divergence'].min()
IDW_mean_per_ind = IDW_maxes_with_pair_IDs.groupby(['IDW_ind'])['divergence'].mean()
IDW_max_per_ind = IDW_maxes_with_pair_IDs.groupby(['IDW_ind'])['divergence'].max()

In [95]:
IDW_stats_per_ind = pd.DataFrame({'N':IDW_n_per_ind, 'min':IDW_min_per_ind, 'mean':IDW_mean_per_ind, 'max':IDW_max_per_ind}).reset_index()
IDW_stats_per_ind.head(5)

Unnamed: 0,IDW_ind,N,min,mean,max
0,Akwaya-Jean,16,0.011237,0.049478,0.20438
1,Alfred,21,0.004539,0.055706,0.201247
2,Alice,9,0.008469,0.118712,0.502525
3,Andromeda,12,0.013257,0.050905,0.107627
4,Athanga,13,0.006031,0.126002,0.463006


Map on lineage for plotting.

In [96]:
lineage_dict = {'Bono':'ppn','Desmond':'ppn','Dzeeta':'ppn','Hermien':'ppn','Hortense':'ppn','Kombote':'ppn','Kosana':'ppn','Kumbuka':'ppn','Natalie':'ppn',
                'Akwaya-Jean':'pte', 'Damian':'pte', 'Julie-LWC21':'pte', 'Koto':'pte', 'Taweh':'pte',
                'Andromeda':'pts','Athanga':'pts','Bihati':'pts','Bwamble':'pts','Cindy-schwein':'pts','Cleo':'pts','Coco-chimp':'pts','Frederike':'pts','Ikuru':'pts','Kidongo':'pts','Maya':'pts','Mgbadolite':'pts','Nakuu':'pts','Tongo':'pts','Trixie':'pts','Vincent':'pts','Washu':'pts',
                'Alfred':'ptt','Blanquita':'ptt','Brigitta':'ptt','Cindy-troglodytes':'ptt','Doris':'ptt','Gamin':'ptt','Julie-A959':'ptt','Lara':'ptt','Luky':'ptt','Marlin':'ptt','Mirinda':'ptt','Negrita':'ptt','Tibe':'ptt','Ula':'ptt','Vaillant':'ptt','Yogui':'ptt',
                'Alice':'ptv','Berta':'ptv','Bosco':'ptv','Cindy-verus':'ptv','Clint':'ptv','Jimmie':'ptv','Koby':'ptv','Linda':'ptv','SeppToni':'ptv'}

In [97]:
IDW_stats_per_ind['lineage'] = IDW_stats_per_ind['IDW_ind'].map(lineage_dict)
IDW_stats_per_ind = IDW_stats_per_ind[['IDW_ind','lineage','N','min','mean','max']]
IDW_stats_per_ind.head(5)

Unnamed: 0,IDW_ind,lineage,N,min,mean,max
0,Akwaya-Jean,pte,16,0.011237,0.049478,0.20438
1,Alfred,ptt,21,0.004539,0.055706,0.201247
2,Alice,ptv,9,0.008469,0.118712,0.502525
3,Andromeda,pts,12,0.013257,0.050905,0.107627
4,Athanga,pts,13,0.006031,0.126002,0.463006


In [98]:
IDW_stats_per_ind['lineage'].unique()

array(['pte', 'ptt', 'ptv', 'pts', 'ppn'], dtype=object)

In [99]:
len(IDW_stats_per_ind[IDW_stats_per_ind['max'] > 0.05])

53

In [100]:
len(IDW_stats_per_ind[IDW_stats_per_ind['max'] > 0.25])

28

In [101]:
28/56

0.5

Save the dataframe for plotting and inclusion in supplemental data.

In [102]:
IDW_stats_per_ind.to_csv('window_topologies/IDW_stats_per_ind.txt', sep = '\t', header = True, index = False)

What about windows with clustering by species?

In [103]:
ppn_pt_windows = complete_linkage_trees[(complete_linkage_trees['cluster_composition'] == 'Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Doris Frederike Gamin Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Koto Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui') | (complete_linkage_trees['cluster_composition'] == 'Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Alice Andromeda Berta Bihati Blanquita Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Coco-chimp Doris Frederike Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Lara Linda Luky Marlin Maya Mirinda Nakuu Negrita Taweh Tibe Trixie Ula')]

In [104]:
len(ppn_pt_windows)

339

What about windows where one chimpanzee subspecies clusters separately?

Central chimpanzees?

In [105]:
len(complete_linkage_trees[(complete_linkage_trees['cluster_composition'] == 'Alfred Blanquita Brigitta Cindy-troglodytes Doris Gamin Julie_A959 Lara Luky Marlin Mirinda Negrita Tibe Ula Vaillant Yogui/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui') | (complete_linkage_trees['cluster_composition'] == 'Blanquita Cindy-troglodytes Doris Julie_A959 Lara Luky Marlin Mirinda Negrita Tibe Ula/Alice Andromeda Berta Bihati Cindy-schwein Cindy-verus Cleo Coco-chimp Dzeeta Frederike Hermien Hortense Jimmie Julie-LWC21 Kidongo Kombote Kosana Kumbuka Linda Maya Nakuu Natalie Taweh Trixie')])

0

Eastern chimpanzees?

In [106]:
len(complete_linkage_trees[(complete_linkage_trees['cluster_composition'] == 'Andromeda Athanga Bihati Cindy-schwein Cleo Coco-chimp Frederike Ikuru Kidongo Maya Mgbadolite Nakuu Tongo Trixie Vincent Washu/Akwaya-Jean Alfred Alice Berta Blanquita Bono Bosco Brigitta Cindy-troglodytes Cindy-verus Clint Damian Desmond Doris Dzeeta Gamin Hermien Hortense Jimmie Julie-A959 Julie-LWC21 Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Mirinda Natalie Negrita SeppToni Taweh Tibe Ula Vaillant Yogui') | (complete_linkage_trees['cluster_composition'] == 'Andromeda Bihati Cindy_schwein Cleo Coco-chimp Frederike Ikuru Kidongo Maya Nakuu Trixie/Alice Berta Blanquita Cindy-troglodytes Cindy-verus Doris Dzeeta Hermien Hortense Jimmie Julie-A959 Julie-LWC21 Kombote Kosana Kumbuka Lara Linda Luky Marlin Mirinda Natalie Negrita Taweh Tibe Ula')])

0

Nigeria-Cameroon chimpanzees?

In [107]:
len(complete_linkage_trees[(complete_linkage_trees['cluster_composition'] == 'Akwaya-Jean Damian Julie-LWC21 Koto Taweh/Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy_schwein Cindy_troglodytes Cindy_verus Cleo Clint Coco_chimp Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Jimmie Julie-A959 Kidongo Koby Kombote Kosana Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui') | (complete_linkage_trees['cluster_composition'] == 'Julie-LWC21/Alice Andromeda Berta Bihati Blanquita Bono Cindy_schwein Cindy_troglodytes Cindy_verus Cleo Coco_chimp Doris Dzeeta Frederike Hermien Hortense Ikuru Jimmie Kidongo Kombote Kosana Kumbuka Lara Linda Luky Marlin Maya Mirinda Nakuu Natalie Negrita Tibe Trixie Ula')])

0

Western chimpanzees?

In [108]:
len(complete_linkage_trees[(complete_linkage_trees['cluster_composition'] == 'Alice Berta Bosco Cindy-verus Clint Jimmie Koby Linda SeppToni/Akwaya-Jean Alfred Andromeda Athanga Bihati Blanquita Bono Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cleo Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Julie-A959 Julie-LWC21 Kidongo Kombote Kosana Koto Kumbuka Lara Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui') | (complete_linkage_trees['cluster_composition'] == 'Alice Berta Cindy-verus Jimmie Linda/Andromeda Bihati Blanquita Cindy-schwein Cindy-troglodytes Cleo Coco-chimp Doris Dzeeta Frederike Hermien Hortense Ikuru Julie-A959 Julie-LWC21 Kidongo Kombote Kosana Kumbuka Lara Luky Marlin Maya Mirinda Nakuu Natalie Negrita Taweh Tibe Trixie Ula')])

8

Let's look at these eight Western chimpanzee windows.

In [109]:
western_chimpanzee_divergent_windows = complete_linkage_trees[(complete_linkage_trees['cluster_composition'] == 'Alice Berta Bosco Cindy-verus Clint Jimmie Koby Linda SeppToni/Akwaya-Jean Alfred Andromeda Athanga Bihati Blanquita Bono Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cleo Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Ikuru Julie-A959 Julie-LWC21 Kidongo Kombote Kosana Koto Kumbuka Lara Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui') | (complete_linkage_trees['cluster_composition'] == 'Alice Berta Cindy-verus Jimmie Linda/Andromeda Bihati Blanquita Cindy-schwein Cindy-troglodytes Cleo Coco-chimp Doris Dzeeta Frederike Hermien Hortense Ikuru Julie-A959 Julie-LWC21 Kidongo Kombote Kosana Kumbuka Lara Luky Marlin Maya Mirinda Nakuu Natalie Negrita Taweh Tibe Trixie Ula')]
western_chimpanzee_divergent_windows

Unnamed: 0,window,Akwaya-Jean,Alfred,Alice,Andromeda,Athanga,Berta,Bihati,Blanquita,Bono,Bosco,Brigitta,Bwamble,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Clint,Coco-chimp,Damian,Desmond,Doris,Dzeeta,Frederike,Gamin,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Koby,Kombote,Kosana,Koto,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mgbadolite,Mirinda,Nakuu,Natalie,Negrita,SeppToni,Taweh,Tibe,Tongo,Trixie,Ula,Vaillant,Vincent,Washu,Yogui,node_0,node_1,node_2,node_3,n_clusters,cluster_composition,single_clusters
998,chr15_28835840,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.000993,0.007565,0.007565,0.004479,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
1020,chr15_40370176,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.002281,0.009507,0.009507,0.006044,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
1798,chr1_210239488,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.00071,0.019751,0.019751,0.013096,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
2708,chr3_169345024,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.002744,0.014599,0.014599,0.007954,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
3153,chr5_44564480,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.000428,0.011706,0.011706,0.004777,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
3154,chr5_45088768,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.000929,0.023077,0.023077,0.014528,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
3425,chr6_52428800,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.001113,0.009093,0.009093,0.003563,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0
3426,chr6_52953088,C2,C2,C1,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.006607,0.030093,0.030093,0.016993,2,Alice Berta Bosco Cindy-verus Clint Jimmie Kob...,0


Thus far, we have identified 1047 IDWs, 339 species divergent, and 8 Western chimpanzee divergent windows. However, this still leaves 2475 two cluster windows: 3622 - (800 + 339 + 8). Based on the autosomal and X chromosome dataframes above, the remaining two cluster windows appear to be topologies where a smaller cluster of individuals from the same lineage or multiple lineages cluster together.

In [110]:
MDW_windows = complete_linkage_trees[complete_linkage_trees['n_clusters'] == 2]
windows_to_exclude = pd.concat([IDW_windows, ppn_pt_windows, western_chimpanzee_divergent_windows])['window']
MDW_windows = MDW_windows[~MDW_windows['window'].isin(windows_to_exclude)]

In [111]:
len(MDW_windows)

2475

What is the range of size and lineage composition of the smaller clusters in the MDWs?

In [112]:
MDW_windows_small_cluster = MDW_windows['cluster_composition'].str.split('/').apply(lambda x: x[0].split())
MDW_windows_small_cluster

1                [Cindy-troglodytes, Doris, Luky, Marlin]
2                [Cindy-troglodytes, Doris, Luky, Marlin]
3                              [Alice, Cindy-verus, Luky]
6       [Alfred, Andromeda, Bono, Coco-chimp, Doris, D...
8       [Bono, Desmond, Dzeeta, Hermien, Hortense, Kos...
                              ...                        
4411                                    [Dzeeta, Kumbuka]
4414                              [Bihati, Cindy-schwein]
4415    [Cindy-troglodytes, Luky, Marlin, Negrita, Tib...
4418    [Blanquita, Dzeeta, Hermien, Hortense, Kombote...
4419    [Andromeda, Cindy-schwein, Cindy-troglodytes, ...
Name: cluster_composition, Length: 2475, dtype: object

In [113]:
MDW_windows_small_cluster_counts = MDW_windows_small_cluster.apply(lambda x: len(x)).to_frame('small_cluster_N')
MDW_windows_small_cluster_counts_summary = MDW_windows_small_cluster_counts['small_cluster_N'].value_counts().to_frame('N').reset_index().rename(columns = {'index':'small_cluster_N'}).sort_values(by=['small_cluster_N'])
MDW_windows_small_cluster_counts_summary

Unnamed: 0,small_cluster_N,N
0,2,436
1,3,264
2,4,166
3,5,154
4,6,122
8,7,98
7,8,101
11,9,68
5,10,116
6,11,102


In [114]:
MDW_windows_small_cluster_counts.median()

small_cluster_N    7.0
dtype: float64

In [115]:
MDW_windows_small_cluster_counts_summary.to_csv('window_topologies/MDW_windows_small_cluster_counts_summary.txt', sep = '\t', header = True, index = False)

In [116]:
MDW_windows_small_cluster_composition = MDW_windows_small_cluster.map(lambda x: [lineage_dict.get(item, item) for item in x]).to_frame('composition')
MDW_windows_small_cluster_composition.head(5)

Unnamed: 0,composition
1,"[ptt, ptt, ptt, ptt]"
2,"[ptt, ptt, ptt, ptt]"
3,"[ptv, ptv, ptt]"
6,"[ptt, pts, ppn, pts, ptt, ppn, ptt, ppn, ppn, ..."
8,"[ppn, ppn, ppn, ppn, ppn, ppn, ppn, ppn]"


In [117]:
pan_lineages = ['ppn','pte','pts','ptt','ptv']
for lineage in pan_lineages:
    MDW_windows_small_cluster_composition[lineage + '_present'] = MDW_windows_small_cluster_composition['composition'].apply(lambda x: lineage in x)
MDW_windows_small_cluster_composition.head(5)

Unnamed: 0,composition,ppn_present,pte_present,pts_present,ptt_present,ptv_present
1,"[ptt, ptt, ptt, ptt]",False,False,False,True,False
2,"[ptt, ptt, ptt, ptt]",False,False,False,True,False
3,"[ptv, ptv, ptt]",False,False,False,True,True
6,"[ptt, pts, ppn, pts, ptt, ppn, ptt, ppn, ppn, ...",True,True,True,True,False
8,"[ppn, ppn, ppn, ppn, ppn, ppn, ppn, ppn]",True,False,False,False,False


In [118]:
MDW_windows_small_cluster_composition[['ppn_present','pte_present','pts_present','ptt_present','ptv_present']].to_csv('window_topologies/MDW_windows_small_cluster_composition.txt', sep = '\t', header = True, index = False)

Now let's look at the topologies along an entire chromosome for Figure 2.

In [119]:
autosomal_complete_linkage_trees[autosomal_complete_linkage_trees['window'].str.startswith('chr21_')]

Unnamed: 0,window,Akwaya-Jean,Alfred,Alice,Andromeda,Athanga,Berta,Bihati,Blanquita,Bono,Bosco,Brigitta,Bwamble,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Clint,Coco-chimp,Damian,Desmond,Doris,Dzeeta,Frederike,Gamin,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Koby,Kombote,Kosana,Koto,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mgbadolite,Mirinda,Nakuu,Natalie,Negrita,SeppToni,Taweh,Tibe,Tongo,Trixie,Ula,Vaillant,Vincent,Washu,Yogui,node_0,node_1,node_2,node_3,n_clusters,cluster_composition,single_clusters
1915,chr21_1048576,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.007273,0.101743,0.101743,0.018713,2,Cindy-troglodytes Kosana/Akwaya-Jean Alfred Al...,0
1916,chr21_1572864,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.009908,0.092433,0.092433,0.030027,2,Cindy-troglodytes Kosana/Akwaya-Jean Alfred Al...,0
1917,chr21_2097152,C3,C1,C1,C2,C3,C1,C3,C3,C3,C1,C1,C1,C3,C3,C3,C2,C1,C2,C3,C3,C3,C3,C2,C3,C3,C3,C2,C1,C2,C2,C2,C1,C3,C3,C3,C3,C2,C1,C3,C3,C1,C2,C3,C3,C3,C1,C1,C3,C1,C1,C2,C1,C3,C2,C3,C1,0.008533,0.016795,0.016795,0.013133,3,Andromeda Cleo Coco-chimp Frederike Ikuru Juli...,0
1918,chr21_2621440,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.001896,0.012374,0.012374,0.008402,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1919,chr21_3145728,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.05812,0.05812,0.038627,2,Hortense/Akwaya-Jean Alfred Alice Andromeda At...,1
1920,chr21_3670016,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.005336,0.025069,0.025069,0.00785,2,Bono Desmond Dzeeta Hortense Kombote Kosana Ku...,0
1921,chr21_4194304,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,0.001153,0.017037,0.017037,0.009157,2,Blanquita Cindy-troglodytes Ula/Akwaya-Jean Al...,0
1922,chr21_4718592,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.056423,0.056423,0.021865,2,SeppToni/Akwaya-Jean Alfred Alice Andromeda At...,1
1923,chr21_5242880,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0,0.119308,0.119308,0.041856,2,SeppToni/Akwaya-Jean Alfred Alice Andromeda At...,1
1924,chr21_5767168,C1,C1,C1,C1,C1,C1,C2,C1,C1,C1,C1,C2,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C2,C1,C1,C1,C1,C1,C2,C1,C1,C1,C1,C1,C1,C1,C1,C1,C0,C1,C1,C2,C1,C1,C1,C1,C1,C1,C1,C2,C1,C1,C1,C2,C2,C1,0.049416,0.123899,0.123899,0.108459,3,Luky/Bihati Bwamble Frederike Julie-A959 Mgbad...,1


And the species divergent windows for this chromosome.

In [120]:
autosomal_complete_linkage_trees[(autosomal_complete_linkage_trees['window'].str.startswith('chr21_')) & (autosomal_complete_linkage_trees['cluster_composition'] == 'Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Doris Frederike Gamin Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Koto Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui')]

Unnamed: 0,window,Akwaya-Jean,Alfred,Alice,Andromeda,Athanga,Berta,Bihati,Blanquita,Bono,Bosco,Brigitta,Bwamble,Cindy-schwein,Cindy-troglodytes,Cindy-verus,Cleo,Clint,Coco-chimp,Damian,Desmond,Doris,Dzeeta,Frederike,Gamin,Hermien,Hortense,Ikuru,Jimmie,Julie-A959,Julie-LWC21,Kidongo,Koby,Kombote,Kosana,Koto,Kumbuka,Lara,Linda,Luky,Marlin,Maya,Mgbadolite,Mirinda,Nakuu,Natalie,Negrita,SeppToni,Taweh,Tibe,Tongo,Trixie,Ula,Vaillant,Vincent,Washu,Yogui,node_0,node_1,node_2,node_3,n_clusters,cluster_composition,single_clusters
1918,chr21_2621440,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.001896,0.012374,0.012374,0.008402,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1935,chr21_11534336,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.000866,0.006946,0.006946,0.004649,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1948,chr21_19398656,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.00329,0.006391,0.006391,0.004359,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1949,chr21_19922944,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.002029,0.024514,0.024514,0.00501,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1950,chr21_20447232,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.002951,0.028111,0.028111,0.016831,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1957,chr21_24117248,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C1,C2,C2,C1,C1,C2,C2,C2,C2,C2,C2,C1,C1,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,0.005625,0.034644,0.034644,0.011145,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0
1964,chr21_27787264,C1,C1,C1,C1,C1,C1,C1,C1,C2,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C2,C1,C2,C1,C1,C2,C2,C1,C1,C1,C1,C1,C1,C2,C2,C1,C2,C1,C1,C1,C1,C1,C1,C1,C1,C2,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,0.0211,0.048625,0.048625,0.03175,2,Bono Desmond Dzeeta Hermien Hortense Kombote K...,0


Does the distribution for window maxima vary by the different topologies?

In [121]:
IDW_window_maxes = maxes[maxes['window'].isin(IDW_windows['window'])]['max']
MDW_window_maxes = maxes[maxes['window'].isin(MDW_windows['window'])]['max']
ppn_pt_window_maxes = maxes[maxes['window'].isin(ppn_pt_windows['window'])]['max']

In [122]:
len(IDW_window_maxes)

800

In [123]:
len(MDW_window_maxes )

2475

In [124]:
len(ppn_pt_window_maxes)

339

In [125]:
IDW_window_maxes.mean()

0.06748740115086496

In [126]:
MDW_window_maxes.mean()

0.053014378694518735

In [127]:
ppn_pt_window_maxes.mean()

0.049107185857274284

In [128]:
kruskal(IDW_window_maxes, MDW_window_maxes, ppn_pt_window_maxes)

KruskalResult(statistic=31.09985751100612, pvalue=1.7650286021802972e-07)

Save these results for visualization.

In [129]:
IDW_window_maxes.to_csv('window_topologies/IDW_window_maxes.txt', sep = '\t', header = False, index = False)
MDW_window_maxes.to_csv('window_topologies/MDW_window_maxes.txt', sep = '\t', header = False, index = False)
ppn_pt_window_maxes.to_csv('window_topologies/ppn_pt_window_maxes.txt', sep = '\t', header = False, index = False)

Let's take a peak at the three, four, and five cluster windows. Change the maximum column width display setting to get a better look.

In [130]:
pd.options.display.max_colwidth = 500

In [131]:
complete_linkage_trees[complete_linkage_trees['n_clusters'] == 3][['window','cluster_composition']].head(10)

Unnamed: 0,window,cluster_composition
0,chr10_1572864,Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Andromeda Athanga Blanquita Bwamble Cindy-schwein Cleo Coco-chimp Doris Kidongo Lara Maya Mgbadolite Mirinda Nakuu Negrita Tongo Trixie Vincent Washu Yogui/Akwaya-Jean Alfred Alice Berta Bihati Bosco Brigitta Cindy-troglodytes Cindy-verus Clint Damian Frederike Gamin Ikuru Jimmie Julie-A959 Julie-LWC21 Koby Koto Linda Luky Marlin SeppToni Taweh Tibe Ula Vaillant
7,chr10_5242880,Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Andromeda Brigitta Coco-chimp Frederike Julie-LWC21 Lara Maya Mirinda Nakuu Tongo Vincent/Akwaya-Jean Alfred Alice Athanga Berta Bihati Blanquita Bosco Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Damian Doris Gamin Ikuru Jimmie Julie-A959 Kidongo Koby Koto Linda Luky Marlin Mgbadolite Negrita SeppToni Taweh Tibe Trixie Ula Vaillant Washu Yogui
13,chr10_9437184,Ikuru/Cleo Kidongo Maya Mgbadolite Washu/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Jimmie Julie-A959 Julie-LWC21 Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Yogui
20,chr10_13107200,Frederike Julie-A959 Tibe Ula/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie Negrita/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bosco Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Doris Gamin Ikuru Jimmie Julie-LWC21 Kidongo Koby Koto Lara Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu SeppToni Taweh Tongo Trixie Vaillant Vincent Washu Yogui
21,chr10_15728640,Kosana Kumbuka/Akwaya-Jean Andromeda Brigitta Cindy-troglodytes Damian Gamin Ikuru Julie-A959 Julie-LWC21 Koto Lara Nakuu Ula Yogui/Alfred Alice Athanga Berta Bihati Blanquita Bono Bosco Bwamble Cindy-schwein Cindy-verus Cleo Clint Coco-chimp Desmond Doris Dzeeta Frederike Hermien Hortense Jimmie Kidongo Koby Kombote Linda Luky Marlin Maya Mgbadolite Mirinda Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Vaillant Vincent Washu
22,chr10_16252928,Julie-A959/Bono Brigitta Desmond Dzeeta Gamin Hermien Hortense Kombote Kosana Kumbuka Linda Mirinda Natalie/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bosco Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Doris Frederike Ikuru Jimmie Julie-LWC21 Kidongo Koby Koto Lara Luky Marlin Maya Mgbadolite Nakuu Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui
31,chr10_22020096,Bono Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Alfred Andromeda Athanga Blanquita Brigitta Coco-chimp Damian Ikuru Julie-LWC21 Kidongo Koto Lara Luky Mgbadolite Taweh Tongo Trixie Ula/Akwaya-Jean Alice Berta Bihati Bosco Bwamble Cindy-schwein Cindy-troglodytes Cindy-verus Cleo Clint Desmond Doris Frederike Gamin Jimmie Julie-A959 Koby Linda Marlin Maya Mirinda Nakuu Negrita SeppToni Tibe Vaillant Vincent Washu Yogui
32,chr10_22544384,Lara/Cindy-schwein Gamin Hortense Julie-A959 Kosana Kumbuka Vaillant/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bono Bosco Brigitta Bwamble Cindy-troglodytes Cindy-verus Cleo Clint Coco-chimp Damian Desmond Doris Dzeeta Frederike Hermien Ikuru Jimmie Julie-LWC21 Kidongo Koby Kombote Koto Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Natalie Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vincent Washu Yogui
33,chr10_23068672,Lara/Alfred Athanga Brigitta Cindy-schwein Cleo Coco-chimp Ikuru Marlin Mgbadolite Mirinda Negrita Trixie Vaillant Vincent/Akwaya-Jean Alice Andromeda Berta Bihati Blanquita Bono Bosco Bwamble Cindy-troglodytes Cindy-verus Clint Damian Desmond Doris Dzeeta Frederike Gamin Hermien Hortense Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Linda Luky Maya Nakuu Natalie SeppToni Taweh Tibe Tongo Ula Washu Yogui
39,chr10_26214400,Cindy-troglodytes Lara/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Akwaya-Jean Alfred Alice Andromeda Athanga Berta Bihati Blanquita Bosco Brigitta Bwamble Cindy-schwein Cindy-verus Cleo Clint Coco-chimp Damian Doris Frederike Gamin Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Koto Linda Luky Marlin Maya Mgbadolite Mirinda Nakuu Negrita SeppToni Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui


In [132]:
complete_linkage_trees[complete_linkage_trees['n_clusters'] == 4][['window','cluster_composition']].head(10)

Unnamed: 0,window,cluster_composition
49,chr10_31457280,Blanquita Bosco Coco-chimp Lara Trixie/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Akwaya-Jean Alice Berta Cindy-verus Clint Damian Jimmie Julie-LWC21 Koby Koto Linda SeppToni Taweh Tibe Tongo/Alfred Andromeda Athanga Bihati Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cleo Doris Frederike Gamin Ikuru Julie-A959 Kidongo Luky Marlin Maya Mgbadolite Mirinda Nakuu Negrita Ula Vaillant Vincent Washu Yogui
101,chr10_65536000,Doris Ula Yogui/Akwaya-Jean Alice Bosco Gamin Ikuru Jimmie Julie-A959 Julie-LWC21 Koby Koto Lara Linda Marlin SeppToni Tibe Vaillant Vincent/Athanga Bono Cindy-schwein Cleo Coco-chimp Desmond Dzeeta Hermien Hortense Kidongo Kombote Kosana Kumbuka Nakuu Natalie Washu/Alfred Andromeda Berta Bihati Blanquita Brigitta Bwamble Cindy-troglodytes Cindy-verus Clint Damian Frederike Luky Maya Mgbadolite Mirinda Negrita Taweh Tongo Trixie
199,chr10_121110528,Brigitta Hermien/Doris Gamin SeppToni/Athanga Bihati Cindy-schwein Coco-chimp Damian Ikuru Julie-A959 Nakuu Tibe Vaillant/Akwaya-Jean Alfred Alice Andromeda Berta Blanquita Bono Bosco Bwamble Cindy-troglodytes Cindy-verus Cleo Clint Desmond Dzeeta Frederike Hortense Jimmie Julie-LWC21 Kidongo Koby Kombote Kosana Koto Kumbuka Lara Linda Luky Marlin Maya Mgbadolite Mirinda Natalie Negrita Taweh Tongo Trixie Ula Vincent Washu Yogui
212,chr11_2097152,Cleo Ikuru/Coco-chimp Kidongo Nakuu/Alfred Blanquita Cindy-troglodytes Gamin Julie-A959 Lara Marlin Negrita Ula Yogui/Akwaya-Jean Alice Andromeda Athanga Berta Bihati Bono Bosco Brigitta Bwamble Cindy-schwein Cindy-verus Clint Damian Desmond Doris Dzeeta Frederike Hermien Hortense Jimmie Julie-LWC21 Koby Kombote Kosana Koto Kumbuka Linda Luky Maya Mgbadolite Mirinda Natalie SeppToni Taweh Tibe Tongo Trixie Vaillant Vincent Washu
282,chr11_41418752,Mirinda Negrita/Bono Dzeeta Hortense Kombote Kosana/Alice Berta Bosco Cindy-verus Koby Linda SeppToni/Akwaya-Jean Alfred Andromeda Athanga Bihati Blanquita Brigitta Bwamble Cindy-schwein Cindy-troglodytes Cleo Clint Coco-chimp Damian Desmond Doris Frederike Gamin Hermien Ikuru Jimmie Julie-A959 Julie-LWC21 Kidongo Koto Kumbuka Lara Luky Marlin Maya Mgbadolite Nakuu Natalie Taweh Tibe Tongo Trixie Ula Vaillant Vincent Washu Yogui
358,chr11_87031808,Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Brigitta Doris Julie-LWC21 Lara Luky Marlin Negrita Tibe Ula Vaillant/Andromeda Athanga Bihati Bwamble Cindy-schwein Cleo Coco-chimp Ikuru Julie-A959 Kidongo Mgbadolite Nakuu Tongo Trixie Vincent Yogui/Akwaya-Jean Alfred Alice Berta Blanquita Bosco Cindy-troglodytes Cindy-verus Clint Damian Frederike Gamin Jimmie Koby Koto Linda Maya Mirinda SeppToni Taweh Washu
584,chr12_95420416,Akwaya-Jean Alfred Koto Nakuu Taweh/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Andromeda Athanga Cindy-schwein Cleo Coco-chimp Ikuru Lara Marlin Negrita Tibe Tongo Trixie/Alice Berta Bihati Blanquita Bosco Brigitta Bwamble Cindy-troglodytes Cindy-verus Clint Damian Doris Frederike Gamin Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Linda Luky Maya Mgbadolite Mirinda SeppToni Ula Vaillant Vincent Washu Yogui
690,chr13_26214400,Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Alice Berta Bosco Brigitta Cindy-verus Cleo Clint Jimmie Julie-A959 Koby Linda Nakuu SeppToni/Akwaya-Jean Andromeda Cindy-troglodytes Damian Doris Frederike Gamin Ikuru Julie-LWC21 Koto Lara Luky Marlin Taweh Tibe Vaillant Washu/Alfred Athanga Bihati Blanquita Bwamble Cindy-schwein Coco-chimp Kidongo Maya Mgbadolite Mirinda Negrita Tongo Trixie Ula Vincent Yogui
757,chr13_65011712,Cindy-troglodytes/Akwaya-Jean Alfred Brigitta Damian Kidongo Koto Taweh Vaillant/Berta Bono Desmond Dzeeta Gamin Hermien Hortense Jimmie Julie-LWC21 Koby Kombote Kosana Kumbuka Luky Marlin Natalie Negrita Ula Yogui/Alice Andromeda Athanga Bihati Blanquita Bosco Bwamble Cindy-schwein Cindy-verus Cleo Clint Coco-chimp Doris Frederike Ikuru Julie-A959 Lara Linda Maya Mgbadolite Mirinda Nakuu SeppToni Tibe Tongo Trixie Vincent Washu
958,chr14_83361792,Blanquita Negrita/Bono Cindy-troglodytes Cleo Desmond Doris Dzeeta Hermien Hortense Kombote Kumbuka Lara Mirinda Natalie Tibe Trixie Yogui/Alfred Andromeda Athanga Brigitta Bwamble Cindy-schwein Coco-chimp Gamin Ikuru Kosana Marlin Maya Mgbadolite Tongo Ula Vaillant Washu/Akwaya-Jean Alice Berta Bihati Bosco Cindy-verus Clint Damian Frederike Jimmie Julie-A959 Julie-LWC21 Kidongo Koby Koto Linda Luky Nakuu SeppToni Taweh Vincent


In [133]:
complete_linkage_trees[complete_linkage_trees['n_clusters'] == 5][['window','cluster_composition']].head(4)

Unnamed: 0,window,cluster_composition
497,chr12_42467328,Hermien Kombote Vaillant/Bono Brigitta Julie-LWC21 Koby Kosana Koto Linda Yogui/Alice Berta Bosco Cindy-verus Clint Desmond Jimmie Natalie SeppToni/Alfred Bihati Bwamble Cindy-schwein Doris Frederike Gamin Ikuru Kidongo Luky Mgbadolite Nakuu Tongo Trixie Vincent Washu/Akwaya-Jean Andromeda Athanga Blanquita Cindy-troglodytes Cleo Coco-chimp Damian Dzeeta Hortense Julie-A959 Kumbuka Lara Marlin Maya Mirinda Negrita Taweh Tibe Ula
1494,chr1_12058624,Athanga/Cindy-troglodytes Frederike Maya/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Alice Bosco Cindy-schwein Cindy-verus Cleo Clint Coco-chimp Doris Ikuru Jimmie Lara Linda Marlin Mgbadolite Nakuu SeppToni Taweh Ula Vincent/Akwaya-Jean Alfred Andromeda Berta Bihati Blanquita Brigitta Bwamble Damian Gamin Julie-A959 Julie-LWC21 Kidongo Koby Koto Luky Mirinda Negrita Tibe Tongo Trixie Vaillant Washu Yogui
2817,chr4_39321600,Brigitta Cleo Negrita Vincent/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka/Bihati Blanquita Kidongo Marlin Mirinda Trixie Ula Vaillant Yogui/Alfred Alice Berta Bosco Cindy-troglodytes Cindy-verus Clint Jimmie Koby Linda SeppToni Taweh/Akwaya-Jean Andromeda Athanga Bwamble Cindy-schwein Coco-chimp Damian Doris Frederike Gamin Ikuru Julie-A959 Julie-LWC21 Koto Lara Luky Maya Mgbadolite Nakuu Natalie Tibe Tongo Washu
2889,chr4_78118912,Brigitta/Bihati Cindy-schwein Cleo Coco-chimp Frederike Ikuru Tibe Tongo/Bono Desmond Dzeeta Hermien Hortense Kombote Kosana Kumbuka Natalie/Akwaya-Jean Alice Cindy-troglodytes Damian Julie-A959 Koto Luky Marlin Mirinda SeppToni Taweh/Alfred Andromeda Athanga Berta Blanquita Bosco Bwamble Cindy-verus Clint Doris Gamin Jimmie Julie-LWC21 Kidongo Koby Lara Linda Maya Mgbadolite Nakuu Negrita Trixie Ula Vaillant Vincent Washu Yogui


Reset the display setting.

In [134]:
pd.options.display.max_colwidth = 50

## Bonobo-Chimpanzee Windows <a class = 'anchor' id = 'bonobochimpanzeewindows'></a>

Let's take a closer look at the 339 windows with bonobo-chimpanzee clustering. Start by creating and saving a BED file.

In [135]:
ppn_pt_clustering_windows_BED = pd.DataFrame()
ppn_pt_clustering_windows_BED['window_split'] = ppn_pt_windows['window']
ppn_pt_clustering_windows_BED = ppn_pt_clustering_windows_BED['window_split'].str.split('_', expand=True)
ppn_pt_clustering_windows_BED.rename(columns = {0:'chr', 1:'window_start'}, inplace = True)
ppn_pt_clustering_windows_BED['window_start'] = ppn_pt_clustering_windows_BED['window_start'].astype(int)
ppn_pt_clustering_windows_BED['window_end'] = ppn_pt_clustering_windows_BED['window_start'] + 1048576
ppn_pt_clustering_windows_BED = ppn_pt_clustering_windows_BED.sort_values(by = 'chr')
ppn_pt_clustering_windows_BED.head(5)

Unnamed: 0,chr,window_start,window_end
1484,chr1,5242880,6291456
1791,chr1,206569472,207618048
1781,chr1,198705152,199753728
1775,chr1,194510848,195559424
1744,chr1,178257920,179306496


In [136]:
ppn_pt_clustering_windows_BED.to_csv('ppn_ptr_windows/ppn_ptr_clustering_windows.bed', sep = '\t', header = False, index = False)

How are these distributed among the chromosomes?

In [137]:
ppn_pt_clustering_windows_BED.groupby(['chr']).size().to_frame('N')

Unnamed: 0_level_0,N
chr,Unnamed: 1_level_1
chr1,23
chr10,20
chr11,17
chr12,7
chr13,17
chr14,14
chr15,10
chr16,2
chr17,2
chr18,8


How many of these loci are unique (i.e., non-overlapping)? First, create a pybedtools object (we'll use this later) and then perform a merge.

In [138]:
ppn_pt_clustering_windows_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_clustering_windows_BED).sort()
ppn_pt_clustering_windows_pbtBED.head()

chr1	5242880	6291456
 chr1	27262976	28311552
 chr1	35651584	36700160
 chr1	40370176	41418752
 chr1	48758784	49807360
 chr1	49283072	50331648
 chr1	71827456	72876032
 chr1	75497472	76546048
 chr1	77594624	78643200
 chr1	78118912	79167488
 

In [139]:
len(ppn_pt_clustering_windows_pbtBED.merge())

252

Let's intersect this object with genes in the panTro6 genome.

In [140]:
genes_pbtBED = pybedtools.BedTool('annotations/panTro6_genes.bed')
genes_pbtBED.head(5)

chr1	344	12299	NM_001280424.1	INTS11
 chr1	12400	16513	XM_016952033.2	CPTP
 chr1	17887	23426	XM_003307748.4	TAS1R3
 chr1	22943	36942	XM_016958290.2	DVL1
 chr1	39619	45507	NM_001280245.1	MXRA8
 

In [141]:
ppn_pt_clustering_windows_genes_intersect = ppn_pt_clustering_windows_pbtBED.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names=['window_chr','window_start','window_end','gene_chr','gene_start','gene_end','gene_transcript','gene'])
ppn_pt_clustering_windows_genes_intersect.head(5)

Unnamed: 0,window_chr,window_start,window_end,gene_chr,gene_start,gene_end,gene_transcript,gene
0,chr1,5242880,6291456,chr1,5295690,5307847,XM_016953263.2,KLHL21
1,chr1,5242880,6291456,chr1,5314703,5321347,XM_024353172.1,LOC112206873
2,chr1,5242880,6291456,chr1,5332024,5341008,XM_514343.6,THAP3
3,chr1,5242880,6291456,chr1,5261155,5284720,XM_525169.5,TAS1R1
4,chr1,5242880,6291456,chr1,5284948,5294269,XM_514341.6,ZBTB48


Let's get a list of genes in these windows.

In [142]:
ppn_pt_clustering_windows_genes = ppn_pt_clustering_windows_genes_intersect['gene'].drop_duplicates().sort_values()
ppn_pt_clustering_windows_genes.head(5)

1173    A4GALT
759      AAGAB
825       AATF
69       ABCA4
1927     ABCB5
Name: gene, dtype: object

In [143]:
len(ppn_pt_clustering_windows_genes)

2035

We will save this file to look for gene enrichment of particular phenotypes.

In [144]:
ppn_pt_clustering_windows_genes.to_csv('phenotype_enrichment_2/data/ppn_pt_clustering_windows_genes.txt', sep = '\t', header = False, index = False)

Let's run another intersection and count the number of genes in each clustering window.

In [145]:
ppn_pt_clustering_windows_genes_count_intersect = ppn_pt_clustering_windows_pbtBED.intersect(genes_pbtBED, c = True).to_dataframe(names=['window_chr','window_start','window_end','gene_count'])
ppn_pt_clustering_windows_genes_count_intersect.head(5)

Unnamed: 0,window_chr,window_start,window_end,gene_count
0,chr1,5242880,6291456,9
1,chr1,27262976,28311552,12
2,chr1,35651584,36700160,17
3,chr1,40370176,41418752,9
4,chr1,48758784,49807360,5


In [146]:
ppn_pt_clustering_windows_genes_count_intersect['gene_count'].min()

0

In [147]:
ppn_pt_clustering_windows_genes_count_intersect['gene_count'].max()

37

In [148]:
ppn_pt_clustering_windows_genes_count_intersect['gene_count'].mean()

7.115044247787611

We should compare these to the genome-wide background.

In [149]:
all_windows_genes_count_intersect = windows_pbtBED.intersect(genes_pbtBED, c = True).to_dataframe(names=['window_chr','window_start','window_end','gene_count'])
all_windows_genes_count_intersect.head(5)

Unnamed: 0,window_chr,window_start,window_end,gene_count
0,chr1,1048576,2097152,14
1,chr1,1572864,2621440,13
2,chr1,2097152,3145728,10
3,chr1,2621440,3670016,1
4,chr1,3145728,4194304,2


In [150]:
all_windows_genes_count_intersect['gene_count'].min()

0

In [151]:
all_windows_genes_count_intersect['gene_count'].max()

67

In [152]:
all_windows_genes_count_intersect['gene_count'].mean()

7.511764705882353

In [153]:
mannwhitneyu(all_windows_genes_count_intersect['gene_count'], ppn_pt_clustering_windows_genes_count_intersect['gene_count'])

MannwhitneyuResult(statistic=774859.0, pvalue=0.2910966842151933)

### In Silico Mutagenesis <a class = 'anchor' id = 'bonobochimpanzeeinsilicomutagenesis'></a>

What variants are driving these patterns? Let's examine the results of in silico mutagenesis in clustering windows using bonobo-specific variants. 

In [154]:
ppn_pt_window_variants = pd.read_csv('in_silico_mutagenesis/ppn_pt_clustering_window_variants.txt', sep = '\t', header = 0)
ppn_pt_window_variants.head(5)

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman
0,chr1,5259439,5242880,C,T,5.200589e-10,1.349846e-09,1.82861e-09
1,chr1,5263963,5242880,G,T,2.797777e-08,8.451103e-08,6.781088e-08
2,chr1,5264491,5242880,A,G,8.753124e-09,2.084939e-08,1.491604e-08
3,chr1,5264816,5242880,C,A,1.798138e-09,5.556265e-09,3.153963e-09
4,chr1,5266605,5242880,G,A,6.075133e-09,2.205699e-08,2.940749e-08


Let's edit the window column to also include the chromosome. 

In [155]:
ppn_pt_window_variants['window'] = ppn_pt_window_variants['chr'] + '_' + ppn_pt_window_variants['window'].astype(str)
ppn_pt_window_variants.head(5)

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman
0,chr1,5259439,chr1_5242880,C,T,5.200589e-10,1.349846e-09,1.82861e-09
1,chr1,5263963,chr1_5242880,G,T,2.797777e-08,8.451103e-08,6.781088e-08
2,chr1,5264491,chr1_5242880,A,G,8.753124e-09,2.084939e-08,1.491604e-08
3,chr1,5264816,chr1_5242880,C,A,1.798138e-09,5.556265e-09,3.153963e-09
4,chr1,5266605,chr1_5242880,G,A,6.075133e-09,2.205699e-08,2.940749e-08


In [156]:
len(ppn_pt_window_variants)

464675

How many windows are represented by these variants?

In [157]:
len(ppn_pt_window_variants['window'].unique())

339

To identify 3D-modifying variants, let's find the value of the lowest bonobo-chimpanzee divergence score in clustering windows and then map that value to the above dataframe.

In [158]:
def get_ppn_pt_window_minima(window):
    subset = comparisons[comparisons['window'] == window]
    subset = subset[subset['dyad_type'] == 'ppn-pt']
    minimum = subset['divergence'].min()
    
    return minimum

In [159]:
ppn_pt_window_mins = [get_ppn_pt_window_minima(window) for window in ppn_pt_windows['window']]
ppn_pt_window_mins_dict = dict(zip(ppn_pt_windows['window'], ppn_pt_window_mins))

In [160]:
ppn_pt_window_variants['min'] = ppn_pt_window_variants['window'].map(ppn_pt_window_mins_dict)

In [161]:
ppn_pt_window_variants.head(5)

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman,min
0,chr1,5259439,chr1_5242880,C,T,5.200589e-10,1.349846e-09,1.82861e-09,0.003167
1,chr1,5263963,chr1_5242880,G,T,2.797777e-08,8.451103e-08,6.781088e-08,0.003167
2,chr1,5264491,chr1_5242880,A,G,8.753124e-09,2.084939e-08,1.491604e-08,0.003167
3,chr1,5264816,chr1_5242880,C,A,1.798138e-09,5.556265e-09,3.153963e-09,0.003167
4,chr1,5266605,chr1_5242880,G,A,6.075133e-09,2.205699e-08,2.940749e-08,0.003167


Now let's identify variants with a measured effect >= than the lowest divergence score among bonobo-chimpanzee pairwise comparisons for that window.

In [162]:
ppn_pt_3d_modifying_variants = ppn_pt_window_variants[ppn_pt_window_variants['1-spearman'] >= ppn_pt_window_variants['min']]
ppn_pt_3d_modifying_variants.head(5)

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman,min
1369,chr1,27588032,chr1_27262976,C,G,0.001536,0.002594,0.003479,0.002844
2481,chr1,36106813,chr1_35651584,G,A,0.001337,0.001991,0.002186,0.001166
5550,chr1,49491849,chr1_48758784,T,C,0.000322,0.000725,0.001038,0.000903
7526,chr1,72255507,chr1_71827456,G,C,0.000248,0.004421,0.008249,0.008168
11067,chr1,78357115,chr1_77594624,G,T,0.000449,0.002686,0.003226,0.003082


Let's save this dataframe.

In [241]:
ppn_pt_3d_modifying_variants.to_csv('ppn_ptr_windows/ppn_pt_3d_modifying_variants.txt', sep = '\t', header = False, index = False)

In [163]:
len(ppn_pt_3d_modifying_variants)

163

In [164]:
len(ppn_pt_3d_modifying_variants[['chr','pos','ref','alt']].drop_duplicates())

136

The presence of duplicates indicates the presence of following scenarios: 1) multiple 3D-modifying variants per window or 2) a 3D-modifying variant impacting adjacent windows. Let's assess.

In [165]:
multi_window_ppn_pt_3d_modifying_variants = ppn_pt_3d_modifying_variants[ppn_pt_3d_modifying_variants.duplicated(['chr','pos'], keep = False)]
multi_window_ppn_pt_3d_modifying_variants.head(5)

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman,min
11067,chr1,78357115,chr1_77594624,G,T,0.000449,0.002686,0.003226,0.003082
11068,chr1,78357115,chr1_78118912,G,T,0.000383,0.005023,0.005868,0.005643
21911,chr1,178484674,chr1_177733632,G,A,0.002663,0.005999,0.005829,0.003804
21912,chr1,178484674,chr1_178257920,G,A,0.002513,0.004524,0.004877,0.003305
72207,chr11,77997697,chr11_77070336,G,A,0.002093,0.013661,0.025199,0.023994


In [166]:
len(multi_window_ppn_pt_3d_modifying_variants)/2

27.0

27 windows have a 3D-modifying variant that impacts both windows. What about windows with multiple variants?

In [167]:
ppn_pt_3d_modifying_variants_n_variants_per_window = ppn_pt_3d_modifying_variants.groupby(['window']).size().to_frame('N')
ppn_pt_3d_modifying_variants_n_variants_per_window[ppn_pt_3d_modifying_variants_n_variants_per_window['N'] > 1]

Unnamed: 0_level_0,N
window,Unnamed: 1_level_1
chr10_87556096,2
chr11_49807360,3
chr4_113770496,2


In [168]:
ppn_pt_3d_modifying_variants[ppn_pt_3d_modifying_variants['window'] == 'chr10_87556096']

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman,min
49247,chr10,88057593,chr10_87556096,G,C,0.018322,0.034889,0.031364,0.026767
49249,chr10,88057595,chr10_87556096,G,T,0.019119,0.03646,0.032893,0.026767


In [169]:
ppn_pt_3d_modifying_variants[ppn_pt_3d_modifying_variants['window'] == 'chr11_49807360']

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman,min
62241,chr11,50007725,chr11_49807360,G,A,0.001016,0.010416,0.012807,0.005267
64136,chr11,50720000,chr11_49807360,C,T,0.00072,0.007704,0.009659,0.005267
64306,chr11,50767074,chr11_49807360,T,A,0.000908,0.009553,0.012341,0.005267


In [170]:
ppn_pt_3d_modifying_variants[ppn_pt_3d_modifying_variants['window'] == 'chr4_113770496']

Unnamed: 0,chr,pos,window,ref,alt,mse,1-pearson,1-spearman,min
309222,chr4,114115871,chr4_113770496,G,A,0.000826,0.026142,0.038261,0.029619
309223,chr4,114115875,chr4_113770496,C,T,0.000969,0.030865,0.045659,0.029619


How many windows are covered by 3D-modifying variants?

In [171]:
len(ppn_pt_3d_modifying_variants['window'].unique())

159

In [172]:
159/339

0.4690265486725664

Are there differences in the mutation frequency of these variants?

In [173]:
ppn_pt_3d_modifying_unique_variants = ppn_pt_3d_modifying_variants[['chr','pos','ref','alt']].drop_duplicates()

In [174]:
ppn_pt_3d_modifying_unique_variant_counts = ppn_pt_3d_modifying_unique_variants.groupby(['ref','alt']).size().to_frame('N').reset_index()
ppn_pt_3d_modifying_unique_variant_counts['prop'] = ppn_pt_3d_modifying_unique_variant_counts['N']/ppn_pt_3d_modifying_unique_variant_counts['N'].sum()
ppn_pt_3d_modifying_unique_variant_counts

Unnamed: 0,ref,alt,N,prop
0,A,C,8,0.058824
1,A,G,15,0.110294
2,A,T,3,0.022059
3,C,A,5,0.036765
4,C,G,7,0.051471
5,C,T,28,0.205882
6,G,A,28,0.205882
7,G,C,11,0.080882
8,G,T,9,0.066176
9,T,A,3,0.022059


In [175]:
ppn_pt_3d_modifying_unique_variant_counts.pivot('ref','alt','prop')

  ppn_pt_3d_modifying_unique_variant_counts.pivot('ref','alt','prop')


alt,A,C,G,T
ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,,0.058824,0.110294,0.022059
C,0.036765,,0.051471,0.205882
G,0.205882,0.080882,,0.066176
T,0.022059,0.110294,0.029412,


How frequent are transitions?

In [176]:
(15+28+28+15)/136

0.6323529411764706

How often do these variants fall within a CTCF binding site. Let's use data from Schwalie et al. 2013.

In [177]:
ppn_pt_3d_modifying_variants_BED = ppn_pt_3d_modifying_variants[['chr','pos']].copy().drop_duplicates()
ppn_pt_3d_modifying_variants_BED = ppn_pt_3d_modifying_variants_BED.rename(columns = {'pos':'end'})
ppn_pt_3d_modifying_variants_BED['start'] = ppn_pt_3d_modifying_variants_BED['end']-1
ppn_pt_3d_modifying_variants_BED = ppn_pt_3d_modifying_variants_BED[['chr','start','end']]
ppn_pt_3d_modifying_variants_BED.head(5)                                                                 

Unnamed: 0,chr,start,end
1369,chr1,27588031,27588032
2481,chr1,36106812,36106813
5550,chr1,49491848,49491849
7526,chr1,72255506,72255507
11067,chr1,78357114,78357115


In [178]:
ppn_pt_3d_modifying_variants_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_3d_modifying_variants_BED).sort()
ppn_pt_3d_modifying_variants_pbtBED.head(5)

chr1	27588031	27588032
 chr1	36106812	36106813
 chr1	49491848	49491849
 chr1	72255506	72255507
 chr1	78357114	78357115
 

In [179]:
len(ppn_pt_3d_modifying_variants_pbtBED)

136

In [180]:
CTCF_pbtBED = pybedtools.BedTool('annotations/panTro6_CTCF.bed')
CTCF_pbtBED.head(5)

chr1	5909	6180	6.166802	270	0	0
 chr1	32407	32758	12.266173	0	0	350
 chr1	35033	35334	7.860073	0	0	300
 chr1	58418	59129	24.39928	710	381	421
 chr1	61438	61919	28.7659185	460	411	451
 

In [181]:
ppn_pt_3d_modifying_unique_variants_CTCF_intersect = ppn_pt_3d_modifying_variants_pbtBED.intersect(CTCF_pbtBED, c = True).to_dataframe(names=['chr','start','end','count'])
ppn_pt_3d_modifying_unique_variants_CTCF_intersect.head(5)

Unnamed: 0,chr,start,end,count
0,chr1,27588031,27588032,1
1,chr1,36106812,36106813,1
2,chr1,49491848,49491849,0
3,chr1,72255506,72255507,0
4,chr1,78357114,78357115,0


In [182]:
len(ppn_pt_3d_modifying_unique_variants_CTCF_intersect[ppn_pt_3d_modifying_unique_variants_CTCF_intersect['count'] > 0])

74

In [183]:
74/136

0.5441176470588235

## Genes in Window Phenotype Enrichment

In [184]:
fdr_table = []

In [185]:
def reportFDRcorrectedPthreshold(set_name, ontology, q_value_threshold, resolution=0.0001, minStart=0):
    fdr_empiric = pd.read_csv(f'phenotype_enrichment_2/empiric_FDR/{set_name}_{ontology}_empiric_FDR.txt', sep = '\t', header = None, index_col = 0)
    obs = pd.read_csv(f'phenotype_enrichment_2/enrichment/{set_name}_{ontology}_enrichment.txt', sep = '\t')

    fdr_threshold = []
    for i in np.arange(minStart,0.05,resolution):
        
        observed_positive = sum(obs['p_value'] <= i)
        average_false_positive = (fdr_empiric <= i).sum().mean()
        q = average_false_positive/observed_positive
        fdr_threshold.append([set_name, ontology, q_value_threshold, i, observed_positive, average_false_positive, q])
        
        if (q != np.inf) & (q > q_value_threshold):
            break
    
    threshold = fdr_threshold[-2]
    fdr_table.append(threshold)
    #fdr_threshold = pd.DataFrame(fdr_threshold, columns = ['pval_threshold','obsPos','avgFalsePos','q'])
    #return fdr_threshold.tail(2).head(1)

In [186]:
#combinations = [(set_name,ontology,q_value_threshold) for set_name in ['ppn_pt_clustering'] for ontology in ['BP','GWAS','HPO','MP'] for q_value_threshold in [0.05,0.1]]

In [187]:
#[reportFDRcorrectedPthreshold(set_name, ontology, q_value_threshold) for set_name, ontology, q_value_threshold in combinations]

  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive


[None, None, None, None, None, None, None, None]

In [188]:
#fdr_table = pd.DataFrame(fdr_table, columns = ['set', 'ontology', 'q_value_threshold', 'p_value_threshold','observed_positive','average_false_positive','q'])
#fdr_table

Unnamed: 0,set,ontology,q_value_threshold,p_value_threshold,observed_positive,average_false_positive,q
0,ppn_pt_clustering,BP,0.05,0.0041,0,5.9302,inf
1,ppn_pt_clustering,BP,0.1,0.0041,0,5.9302,inf
2,ppn_pt_clustering,GWAS,0.05,0.0006,0,0.1404,inf
3,ppn_pt_clustering,GWAS,0.1,0.0006,0,0.1404,inf
4,ppn_pt_clustering,HPO,0.05,0.0097,0,5.3889,inf
5,ppn_pt_clustering,HPO,0.1,0.0097,0,5.3889,inf
6,ppn_pt_clustering,MP,0.05,0.0035,0,4.0352,inf
7,ppn_pt_clustering,MP,0.1,0.0035,0,4.0352,inf


## Differential Expression

In [215]:
genes = pd.read_csv('annotations/panTro6_genes.bed', sep = '\t', names = ['chr','start','end','transcript','gene'])
genes.head(5)

Unnamed: 0,chr,start,end,transcript,gene
0,chr1,344,12299,NM_001280424.1,INTS11
1,chr1,12400,16513,XM_016952033.2,CPTP
2,chr1,17887,23426,XM_003307748.4,TAS1R3
3,chr1,22943,36942,XM_016958290.2,DVL1
4,chr1,39619,45507,NM_001280245.1,MXRA8


In [216]:
len(genes)

20908

In [217]:
genes_header = ['gene']
missing_expression_genes = pd.read_csv('RNAseq/ppn_pt_missing_expression_genes.txt', sep = '\t', names = genes_header)
missing_expression_genes.head(5)

Unnamed: 0,gene
0,ALG11
1,AMELY
2,ARL4A
3,C12H12orf77
4,C19H19orf33


In [218]:
len(missing_expression_genes)

2517

In [219]:
genes_with_expression_data = genes[~genes['gene'].isin(missing_expression_genes['gene'])]
len(genes_with_expression_data)

20511

In [220]:
DE_genes = pd.read_csv('RNAseq/ppn_pt_DE_genes.txt', sep = '\t', names = genes_header)
non_DE_genes = pd.read_csv('RNAseq/ppn_pt_non_DE_genes.txt', sep = '\t', names = genes_header)

In [221]:
len(DE_genes)

3332

In [222]:
len(non_DE_genes)

28869

In [223]:
DE_genes = DE_genes[DE_genes['gene'].isin(genes_with_expression_data['gene'])]
len(DE_genes)

2957

In [224]:
non_DE_genes = non_DE_genes[non_DE_genes['gene'].isin(genes_with_expression_data['gene'])]
len(non_DE_genes)

17543

In [225]:
non_clustering = genes[~genes['gene'].isin(ppn_pt_clustering_windows_genes)]

In [226]:
len(list(set(ppn_pt_clustering_windows_genes).intersection(DE_genes['gene'])))

263

In [227]:
len(list(set(ppn_pt_clustering_windows_genes).intersection(non_DE_genes['gene'])))

1727

In [228]:
len(list(set(non_clustering['gene']).intersection(DE_genes['gene'])))

2694

In [229]:
len(list(set(non_clustering['gene']).intersection(non_DE_genes['gene'])))

15816

In [235]:
fisher_exact([[263,1727],[2694,15816]])

(0.8940513758297084, 0.10716918527545538)

In [240]:
sorted(list(set(ppn_pt_clustering_windows_genes).intersection(DE_genes['gene'])))

['AATF',
 'ADAM22',
 'ADAMTSL2',
 'ADORA1',
 'AK3',
 'AK9',
 'AMZ1',
 'ANGPTL1',
 'ANGPTL8',
 'ANTXR2',
 'AP1S2',
 'APOL6',
 'APP',
 'AQP4',
 'ARL3',
 'ARRDC4',
 'ARSB',
 'ATG13',
 'ATP5PO',
 'ATP8B4',
 'BCL2L11',
 'BEND6',
 'BHMT2',
 'BTF3',
 'BTG2',
 'C10H10orf88',
 'C10H10orf95',
 'C11H11orf49',
 'C7H7orf25',
 'CAB39L',
 'CALB1',
 'CALCRL',
 'CARM1',
 'CCDC134',
 'CCDC177',
 'CCL4',
 'CCL5',
 'CDH11',
 'CDH22',
 'CDKN2B',
 'CELF3',
 'CEP295',
 'CEP85L',
 'CERCAM',
 'CHGB',
 'CHI3L1',
 'CHIT1',
 'CHMP4B',
 'CHRDL1',
 'CHST12',
 'CHST8',
 'CHST9',
 'CIZ1',
 'CLTRN',
 'CNIH3',
 'CNOT3',
 'CNST',
 'COL4A3',
 'COPG1',
 'COX7B',
 'CPEB4',
 'CPLANE1',
 'CRMP1',
 'CYP2C8',
 'CYTIP',
 'DDIT4L',
 'DGKB',
 'DGKD',
 'DGKZ',
 'DHRS11',
 'DHX30',
 'DLG5',
 'DLGAP1',
 'DNAJC15',
 'DOCK10',
 'EFR3B',
 'EIF3B',
 'EIF3G',
 'ELAVL3',
 'EMCN',
 'EPCAM',
 'EPOR',
 'EVC2',
 'F2',
 'FAM102A',
 'FAM151B',
 'FAM172A',
 'FAM198B',
 'FAT2',
 'FBXW4',
 'FGFR2',
 'FMOD',
 'FRMPD1',
 'FZD8',
 'GATB',
 'GBF1',
 '

In [None]:
windows_BED_df = pd.read_csv('phenotype_enrichment_2/data/panTro6_windows_with_full_coverage.bed', sep = '\t', names = ['chr','start','end','n_missing'])
windows_BED_df = windows_BED_df.drop(columns=['n_missing'])
windows_BED_df.head(5)

In [None]:
len(windows_BED_df)

In [None]:
sampled_df = windows_BED_df.sample(n = 339, random_state = 621)

In [None]:
sampled_df.head(5)

In [None]:
windows_BED_sample_pbtBED = pybedtools.BedTool().from_dataframe(sampled_df)

In [None]:
intersect = windows_BED_sample_pbtBED.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names = ['window_chr','window_start','window_end','gene_chr','gene_start','gene_end','gene_transcript','gene'])

In [None]:
intersect.head()

In [None]:
shuffled_genes = list([x for x in intersect['gene'] if str(x) != '.'])

In [None]:
len(shuffled_genes)

In [None]:
genes_pbtBED.head()

In [None]:
ppn_pt_clustering_windows_pbtBED.head()

In [None]:
shuffled_windows = ppn_pt_clustering_windows_pbtBED.shuffle(g = 'phenotype_enrichment_2/data/panTro6_chr_lengths.txt', incl = 'phenotype_enrichment_2/data/panTro6_windows_with_full_coverage.bed')

In [None]:
shuffled_windows.head()

In [None]:
intersect = shuffled_windows.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names = ['window_chr','window_start','window_end','gene_chr','gene_start','gene_end','gene_transcript','gene'])

In [None]:
intersect.head()

In [None]:
import random

In [None]:
random_indices = random.sample(range(1,4420), 339)

In [None]:
windows_pbtBED.head()

In [None]:
windows_pbtBED

In [None]:
selected_lines = [windows_pbtBED[index] for index in random_indices]

In [None]:
selected_lines

In [None]:
with open('phenotype_enrichment_2/data/panTro6_windows_with_full_coverage.bed', 'r') as file:
    for line in file:
        line = [x.strip() for x in line.split('\t')]

In [None]:
with open('phenotype_enrichment_2/data/panTro6_windows_with_full_coverage.bed', 'r') as file:
    for line in file:
        line = [x.strip() for x in line.split('\t')]

total_lines = len(bed_lines)
lines_to_select = 339

# Generate unique random numbers
random_indices = random.sample(range(total_lines), lines_to_select)

# Retrieve selected lines
selected_lines = [bed_lines[index] for index in random_indices]

In [None]:
pd.Series(selected_lines)

In [None]:
dir()

Let's create a BED file of the variants to complete intersections in the following steps. We will end up creating a BED file from a dataframe more than once so let's write a function.

In [None]:
def dataframe_to_BED(input_df):
    input_df_BED = input_df[['chr','pos']].copy()
    input_df_BED.rename(columns={ input_df_BED.columns[1]: 'end' }, inplace = True)
    input_df_BED['start'] = input_df_BED['end'].astype(int)-1
    input_df_BED = input_df_BED[['chr','start','end']]
    input_df_BED = input_df_BED.drop_duplicates()
    return input_df_BED

In [None]:
ppn_pt_clustering_3d_modifying_variants_BED = dataframe_to_BED(ppn_pt_clustering_3d_modifying_variants)
ppn_pt_clustering_3d_modifying_variants_BED.head(5)

In [None]:
len(ppn_pt_clustering_3d_modifying_variants_BED)

Save this dataframe.

In [None]:
ppn_pt_clustering_3d_modifying_variants_BED.to_csv('clustering_windows/ppn_pt_clustering_3d_modifying_variants.bed', sep = '\t', header = False, index = False)

Convert to pybedtools.

In [None]:
ppn_pt_clustering_3d_modifying_variants_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_clustering_3d_modifying_variants_BED).sort()
ppn_pt_clustering_3d_modifying_variants_pbtBED.head()

In [None]:
len(ppn_pt_clustering_3d_modifying_variants_pbtBED)

Now let's link these variants to TADs and then to genes. Use TADs for features in "A" and variants for features in "B" to make the downstream gene intersection easier. 

In [None]:
ppn_pt_clustering_3d_modifying_variant_TAD_intersection_pbtBED = TADs_pbtBED.intersect(ppn_pt_clustering_3d_modifying_variants_pbtBED, wa = True, wb = True)
ppn_pt_clustering_3d_modifying_variant_TAD_intersection_pbtBED.head(5)

In [None]:
len(ppn_pt_clustering_3d_modifying_variant_TAD_intersection_pbtBED)

Now load the gene annotations, perform the intersection, and convert to a dataframe.

In [None]:
genes_pbtBED = pybedtools.BedTool('annotations/panTro6_genes.bed')
genes_pbtBED.head(5)

In [None]:
ppn_pt_clustering_3d_modifying_variant_TAD_gene_intersection = genes_pbtBED.intersect(ppn_pt_clustering_3d_modifying_variant_TAD_intersection_pbtBED, wa = True, wb = True).to_dataframe(names=['gene_chr','gene_start','gene_end','transcript','gene','TAD_chr','TAD_start','TAD_end','variant_chr','variant_start','variant_end'])
ppn_pt_clustering_3d_modifying_variant_TAD_gene_intersection.head(5)

Now subset this dataframe to variants and genes. We will remove any duplicates because they represent nested TADs.

In [None]:
ppn_pt_clustering_3d_modified_genes = ppn_pt_clustering_3d_modifying_variant_TAD_gene_intersection[['variant_end','gene']]
ppn_pt_clustering_3d_modified_genes = ppn_pt_clustering_3d_modified_genes.drop_duplicates()

The Eres et al. 2019 TADs are missing from chromosome 7 so let's grab a list of nearby genes for any 3d-modifying variants found there.

In [None]:
ppn_pt_clustering_3d_modifying_chr7_variants = ppn_pt_clustering_3d_modifying_variants[ppn_pt_clustering_3d_modifying_variants['chr'] == 'chr7']
ppn_pt_clustering_3d_modifying_chr7_variants.head(9)

In [None]:
ppn_pt_clustering_3d_modifying_chr7_variants_BED = dataframe_to_BED(ppn_pt_clustering_3d_modifying_chr7_variants)
ppn_pt_clustering_3d_modifying_chr7_variants_BED.head(5)

Add 150,000 bp to both the start and end coordinate to generate an average sized TAD around the variant.

In [None]:
ppn_pt_clustering_3d_modifying_chr7_variants_BED['start'] = ppn_pt_clustering_3d_modifying_chr7_variants_BED['start'] - 150000
ppn_pt_clustering_3d_modifying_chr7_variants_BED['end'] = ppn_pt_clustering_3d_modifying_chr7_variants_BED['end'] + 150000
ppn_pt_clustering_3d_modifying_chr7_variants_BED.head(5)

Convert to pybedtools format.

In [None]:
ppn_pt_clustering_3d_modifying_chr7_variants_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_clustering_3d_modifying_chr7_variants_BED).sort()
ppn_pt_clustering_3d_modifying_chr7_variants_pbtBED.head()

And intersect with the gene annotations.

In [None]:
ppn_pt_clustering_3d_modifying_chr7_variants_TAD_gene_intersection = ppn_pt_clustering_3d_modifying_chr7_variants_pbtBED.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names = ['variant_chr','variant_start','variant_end','gene_chr','gene_start','gene_end','gene_transcript','gene'])
ppn_pt_clustering_3d_modifying_chr7_variants_TAD_gene_intersection.head(5)

Now combine chr7 with the rest of the autosomes and save the gene list for phenotype enrichment analyses.

In [None]:
ppn_pt_clustering_3d_modified_genes = list([x for x in ppn_pt_clustering_3d_modified_genes['gene'] if str(x) != '.']) + list([x for x in ppn_pt_clustering_3d_modifying_chr7_variants_TAD_gene_intersection['gene'] if str(x) != '.'])
ppn_pt_clustering_3d_modified_genes = pd.DataFrame(ppn_pt_clustering_3d_modified_genes, columns = ['gene'])
ppn_pt_clustering_3d_modified_genes.head(5)

In [None]:
len(ppn_pt_clustering_3d_modified_genes)

In [None]:
ppn_pt_clustering_3d_modified_genes.to_csv('clustering_windows/ppn_pt_clustering_3d_modified_genes.txt', sep = '\t', header = False, index = False)

### Divergent Windows: In Silico Mutagenesis <a class = 'anchor' id = 'divergentwindowsinsilicomutagenesis'></a>

Now for variants in divergent windows. 

In [None]:
ppn_pt_divergent_window_variants = pd.read_csv('in_silico_mutagenesis/ppn_pt_divergent_window_variants.txt', sep = '\t', header = 0)
ppn_pt_divergent_window_variants.head(5)

In [None]:
len(ppn_pt_divergent_window_variants)

In [None]:
ppn_pt_divergent_window_variants['window'] = ppn_pt_divergent_window_variants['chr'] + '_' + ppn_pt_divergent_window_variants['window'].astype(str)
ppn_pt_divergent_window_variants.head(5)

Now map these values to the windows in the in silico mutagenesis dataframe.

In [None]:
ppn_pt_divergent_window_variants['empirical_max'] = ppn_pt_divergent_window_variants['window'].map(ppn_pt_divergent_windows_chimp_maxes)
ppn_pt_divergent_window_variants.head(5)

Which variants cause a 3D change that is greater or equal to the observed maximum difference in chimpanzees?

In [None]:
ppn_pt_divergent_3d_modifying_variants = ppn_pt_divergent_window_variants[ppn_pt_divergent_window_variants['1-spearman'] >= ppn_pt_divergent_window_variants['empirical_max']]
ppn_pt_divergent_3d_modifying_variants.head(5)

In [None]:
len(ppn_pt_divergent_3d_modifying_variants)

How many windows are represented by these variants?

In [None]:
len(ppn_pt_divergent_3d_modifying_variants['window'].unique())

Output these variants in the original in silico mutagenesis input format so that we can generate some maps.

In [None]:
ppn_pt_divergent_3d_modifying_variants_with_window = ppn_pt_divergent_3d_modifying_variants[['chr','pos','ref','alt','window']].copy()
ppn_pt_divergent_3d_modifying_variants_with_window['window'] = ppn_pt_divergent_3d_modifying_variants_with_window['window'].str.split('_').str[1]
ppn_pt_divergent_3d_modifying_variants_with_window.head(5)

In [None]:
ppn_pt_divergent_3d_modifying_variants_with_window.to_csv('divergent_windows/ppn_pt_divergent_3d_modifying_variants_with_window.txt', sep = '\t', header = False, index = False)

Let's create a BED file of the variants to complete intersections we will complete in the following steps.

In [None]:
ppn_pt_divergent_3d_modifying_variants_BED = dataframe_to_BED(ppn_pt_divergent_3d_modifying_variants)
ppn_pt_divergent_3d_modifying_variants_BED.head(5)

In [None]:
len(ppn_pt_divergent_3d_modifying_variants_BED)

We will be shuffling these variants in our phenotype enrichment momentarily so go ahead and save the frame.

In [None]:
ppn_pt_divergent_3d_modifying_variants_BED.to_csv('divergent_windows/ppn_pt_divergent_3d_modifying_variants.bed', sep = '\t', header = False, index = False)

Now let's get these variants in pybedtools BED format.

In [None]:
ppn_pt_divergent_3d_modifying_variants_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_divergent_3d_modifying_variants_BED).sort()
ppn_pt_divergent_3d_modifying_variants_pbtBED.head()

In [None]:
len(ppn_pt_divergent_3d_modifying_variants_pbtBED)

In [None]:
ppn_pt_divergent_3d_modifying_variant_TAD_intersection_pbtBED = TADs_pbtBED.intersect(ppn_pt_divergent_3d_modifying_variants_pbtBED, wa = True, wb = True)
ppn_pt_divergent_3d_modifying_variant_TAD_intersection_pbtBED.head(5)

In [None]:
len(ppn_pt_divergent_3d_modifying_variant_TAD_intersection_pbtBED)

Intersect with genes and convert to a dataframe.

In [None]:
ppn_pt_divergent_3d_modifying_variant_TAD_gene_intersection = genes_pbtBED.intersect(ppn_pt_divergent_3d_modifying_variant_TAD_intersection_pbtBED, wa = True, wb = True).to_dataframe(names=['gene_chr','gene_start','gene_end','transcript','gene','TAD_chr','TAD_start','TAD_end','variant_chr','variant_start','variant_end'])
ppn_pt_divergent_3d_modifying_variant_TAD_gene_intersection.head(5)

Now subset this dataframe to variants and genes. We will remove any duplicates because they represent nested TADs.

In [None]:
ppn_pt_divergent_3d_modified_genes = ppn_pt_divergent_3d_modifying_variant_TAD_gene_intersection[['variant_end','gene']]
ppn_pt_divergent_3d_modified_genes = ppn_pt_divergent_3d_modified_genes.drop_duplicates()

Tackle the chromosome 7 situation again.

In [None]:
ppn_pt_divergent_3d_modifying_chr7_variants = ppn_pt_divergent_3d_modifying_variants[ppn_pt_divergent_3d_modifying_variants['chr'] == 'chr7']
ppn_pt_divergent_3d_modifying_chr7_variants.head(9)

In [None]:
ppn_pt_divergent_3d_modifying_chr7_variants_BED = dataframe_to_BED(ppn_pt_divergent_3d_modifying_chr7_variants)
ppn_pt_divergent_3d_modifying_chr7_variants_BED.head(5)

Add 150,000 bp to both the start and end coordinate to generate an average sized TAD around the variant.

In [None]:
ppn_pt_divergent_3d_modifying_chr7_variants_BED['start'] = ppn_pt_divergent_3d_modifying_chr7_variants_BED['start'] - 150000
ppn_pt_divergent_3d_modifying_chr7_variants_BED['end'] = ppn_pt_divergent_3d_modifying_chr7_variants_BED['end'] + 150000
ppn_pt_divergent_3d_modifying_chr7_variants_BED.head(5)

Convert to pybedtools format.

In [None]:
ppn_pt_divergent_3d_modifying_chr7_variants_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_divergent_3d_modifying_chr7_variants_BED).sort()
ppn_pt_divergent_3d_modifying_chr7_variants_pbtBED.head()

And intersect with the gene annotations.

In [None]:
ppn_pt_divergent_3d_modifying_chr7_variants_TAD_gene_intersection = ppn_pt_divergent_3d_modifying_chr7_variants_pbtBED.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names = ['variant_chr','variant_start','variant_end','gene_chr','gene_start','gene_end','gene_transcript','gene'])
ppn_pt_divergent_3d_modifying_chr7_variants_TAD_gene_intersection.head(5)

Now combine chr7 with the rest of the autosomes and save the gene list for phenotype enrichment analyses.

In [None]:
ppn_pt_divergent_3d_modified_genes = list([x for x in ppn_pt_divergent_3d_modified_genes['gene'] if str(x) != '.']) + list([x for x in ppn_pt_clustering_3d_modifying_chr7_variants_TAD_gene_intersection['gene'] if str(x) != '.'])
ppn_pt_divergent_3d_modified_genes = pd.DataFrame(ppn_pt_divergent_3d_modified_genes, columns = ['gene'])
ppn_pt_divergent_3d_modified_genes.head(5)

In [None]:
len(ppn_pt_divergent_3d_modified_genes)

In [None]:
ppn_pt_divergent_3d_modified_genes.to_csv('divergent_windows/ppn_pt_divergent_3d_modified_genes.txt', sep = '\t', header = False, index = False)

### Divergent Windows: Differential Gene Expression <a class = 'anchor' id = 'divergentwindowsexpression'></a>

Let's consider whether genes topologically associated with 3d modifying variants between bonobos and chimpanzees exhibit differential gene expression. Download the SRA data from Brawand et al. 2011 and run the RNAseq pipeline to generate read counts per gene per sample. Below we will gather and analyze these data. 

Write a tissue dictionary.

In [None]:
tissue_dict = dict({'SRR306811': 'prefrontal_cortex',
                 'SRR306817': 'cerebellum',
                 'SRR306818': 'cerebellum', 
                 'SRR306819': 'heart', 
                 'SRR306820': 'heart', 
                 'SRR306821': 'kidney', 
                 'SRR306822': 'kidney', 
                 'SRR306823': 'liver', 
                 'SRR306824': 'liver', 
                 'SRR306825': 'testis',
                 'SRR306827': 'prefrontal_cortex',
                 'SRR306828': 'prefrontal_cortex',
                 'SRR306829': 'cerebellum',
                 'SRR306830': 'cerebellum', 
                 'SRR306831': 'heart', 
                 'SRR306832': 'heart', 
                 'SRR306833': 'kidney', 
                 'SRR306834': 'kidney', 
                 'SRR306835': 'liver', 
                 'SRR306836': 'liver', 
                 'SRR306837': 'testis'})

Define which samples belong to which individual.

In [None]:
ppn_female = ['SRR306827','SRR306829','SRR306831','SRR306833','SRR306835']
ppn_male = ['SRR306828','SRR306830','SRR306832','SRR306834','SRR306836','SRR306837']
ptr_female = ['SRR306811','SRR306817','SRR306819','SRR306821','SRR306823']
ptr_male = ['SRR306818','SRR306820','SRR306822','SRR306824','SRR306825']

Now let's write a function to gather the read count data per individual.

In [None]:
def reads_per_individual(individual, SRR_ids_list):
    individual_read_counts_dfs_list = []
    
    for i in range(len(SRR_ids_list)):
        individual_temp_df = pd.read_csv('RNAseq/read_counts/'+SRR_ids_list[i]+'_read_counts.txt', sep = '\t', names = ['gene',SRR_ids_list[i]])
        individual_read_counts_dfs_list.append(individual_temp_df)
        
    for df in individual_read_counts_dfs_list:
        df.set_index('gene', inplace = True)
        
    individual_read_counts_df = pd.concat(individual_read_counts_dfs_list, axis = 1, sort = False).reset_index()
    
    individual_read_counts_df = individual_read_counts_df.set_index(['gene']).stack().to_frame().reset_index()
    individual_read_counts_df.rename(columns={ 'level_1': 'sample', 0: individual}, inplace = True)
    individual_read_counts_df['tissue'] = individual_read_counts_df['sample'].map(tissue_dict)
    individual_read_counts_df = individual_read_counts_df.drop('sample', axis = 1)
    individual_read_counts_df = individual_read_counts_df[~individual_read_counts_df['gene'].str.startswith('_')]
    individual_read_counts_df = individual_read_counts_df[~individual_read_counts_df['gene'].str.startswith('unassigned')]
    individual_read_counts_df = individual_read_counts_df[['gene','tissue',individual]]
    
    return individual_read_counts_df

Apply the function to all four individuals.

In [None]:
ppn_female_reads = reads_per_individual('ppn_female', ppn_female).set_index(['gene','tissue'])
ppn_male_reads = reads_per_individual('ppn_male', ppn_male).set_index(['gene','tissue'])
ptr_female_reads = reads_per_individual('ptr_female', ptr_female).set_index(['gene','tissue'])
ptr_male_reads = reads_per_individual('ptr_male', ptr_male).set_index(['gene','tissue'])

Gather those data and check out the dataframe.

In [None]:
gene_expression = pd.concat([ppn_female_reads, ppn_male_reads, ptr_female_reads, ptr_male_reads], axis = 1, sort = False).reset_index()
gene_expression.head(12)

Now let's calculate means per species and the species difference.

In [None]:
gene_expression['ppn_mean'] = gene_expression[['ppn_female', 'ppn_male']].mean(axis=1, skipna=True)
gene_expression['ptr_mean'] = gene_expression[['ptr_female', 'ptr_male']].mean(axis=1, skipna=True)
gene_expression['fold_change'] = np.log2(gene_expression['ptr_mean']/gene_expression['ppn_mean'])
gene_expression.head(12)

In [None]:
gene_expression.replace([np.inf, -np.inf], np.nan, inplace=True)
gene_expression.dropna(inplace=True)

In [None]:
gene_expression.head(12)

In [None]:
gene_expression[gene_expression['gene'] == 'ZNF804B']

In [None]:
len(gene_expression)

Let's test for enrichment of differently expressed genes in 3d clustering and divergent windows. First, we need to intersect the windows with the genes.

In [None]:
ppn_pt_clustering_windows_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_clustering_windows_BED)

In [None]:
ppn_pt_clustering_windows_genes_intersect = ppn_pt_clustering_windows_pbtBED.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names = ['window_chr','window_start','window_end','gene_chr','gene_start','gene_end','gene_transcript','gene'])

In [None]:
len(ppn_pt_clustering_windows_genes_intersect)

In [None]:
339/4420

In [None]:
2412/20920

Let's test for enrichment of 

In [None]:
genes = pd.read_csv('annotations/panTro6_genes.bed', sep = '\t', names = ['chr','start','end','transcript','gene'])
genes.head(5)

In [None]:
ppn_pt_divergent_3d_modified_genes.head(5)

In [None]:
ppn_pt_divergent_3d_modified_genes = ppn_pt_divergent_3d_modified_genes.drop_duplicates()
ppn_pt_non_divergent_3d_modified_genes = genes[~genes['gene'].isin(ppn_pt_divergent_3d_modified_genes['gene'])]

In [None]:
len(ppn_pt_divergent_3d_modified_genes)

In [None]:
len(ppn_pt_non_divergent_3d_modified_genes)

In [None]:
differently_expressed_genes = gene_expression[(gene_expression['fold_change'] >= 2) | (gene_expression['fold_change'] <= -2)]['gene'].to_frame('gene')
differently_expressed_genes = differently_expressed_genes.drop_duplicates()
differently_expressed_genes.head(5)

In [None]:
len(differently_expressed_genes)

In [None]:
non_differently_expressed_genes = genes[~genes['gene'].isin(differently_expressed_genes['gene'])]

In [None]:
len(non_differently_expressed_genes)

In [None]:
len(list(set(ppn_pt_divergent_3d_modified_genes['gene']).intersection(differently_expressed_genes['gene'])))

In [None]:
len(list(set(ppn_pt_divergent_3d_modified_genes['gene']).intersection(non_differently_expressed_genes['gene'])))

In [None]:
len(list(set(ppn_pt_non_divergent_3d_modified_genes['gene']).intersection(differently_expressed_genes['gene'])))

In [None]:
len(list(set(ppn_pt_non_divergent_3d_modified_genes['gene']).intersection(non_differently_expressed_genes['gene'])))

In [None]:
list(set(ppn_pt_divergent_3d_modified_genes['gene']).intersection(differently_expressed_genes['gene']))

In [None]:
fisher_exact([[35,118],[3655,17100]])

In [None]:
fisher_exact([[55,98],[6066,14689]])

In [None]:
fisher_exact([[96,57],[10948,9807]])

In [None]:
#def fisher_exact(A, B, C, D):
#    OR, p = fisher_exact([[A, B], [C, D]])
#    lCI = np.exp((np.log(OR)) - (1.96 * (sqrt((1/A) + (1/B) + (1/C) + (1/D)))))
#    uCI = np.exp((np.log(OR)) + (1.96 * (sqrt((1/A) + (1/B) + (1/C) + (1/D)))))
#    return OR, p, lCI, uCI

In [None]:
fisher_exact(35,118,3655,17100)

In [None]:
len(ppn_pt_divergent_3d_modified_genes)

In [None]:
len(ppn_pt_divergent_3d_modified_genes.drop_duplicates())

In [None]:
len(ppn_pt_non_divergent_3d_modified_genes)

In [None]:
ppn_pt_non_divergent_3d_modified_genes = ~genes['gene'].isin(ppn_pt_divergent_3d_modified_genes['gene'])

In [None]:
len(genes)

In [None]:
gene_expression_subset = gene_expression[gene_expression['gene'].isin(ppn_pt_divergent_3d_modified_genes['gene'])]
gene_expression_subset.head(12)

In [None]:
len(gene_expression_subset)

In [None]:
len(gene_expression_subset['gene'].unique())

In [None]:
gene_expression_others = gene_expression[~gene_expression['gene'].isin(divergent_genes['gene'])]
gene_expression_others.head(12)

In [None]:
len(gene_expression_others)

In [None]:
gene_expression_subset.to_csv('divergent_windows/gene_expression_subset.txt', sep = '\t', header = True, index = False)

In [None]:
gene_expression_others.to_csv('divergent_windows/gene_expression_others.txt', sep = '\t', header = True, index = False)

In [None]:
len(gene_expression_subset[gene_expression_subset['fold_change'] > 2])

In [None]:
gene_expression_subset[gene_expression_subset['fold_change'] > 2].head(30)

In [None]:
len(gene_expression_subset[gene_expression_subset['fold_change'] < -2])

In [None]:
gene_expression_subset[gene_expression_subset['fold_change'] < -2]

## Lineage-Specific Substitutions

In [None]:
pte_specific_variants = pd.read_csv('in_silico_mutagenesis/pte_specific_variants.txt', sep = '\t', header = 0)
pte_specific_variants.head(5)

In [None]:
pte_specific_variants[pte_specific_variants['1-spearman'] > 0.001]

In [None]:
pts_specific_variants = pd.read_csv('in_silico_mutagenesis/pts_variants.txt', sep = '\t', header = 0)
pts_specific_variants.head(5)

In [None]:
pts_specific_variants[pts_specific_variants['1-spearman'] > 0.001]

In [None]:
ptt_specific_variants = pd.read_csv('in_silico_mutagenesis/ptt_variants.txt', sep = '\t', header = 0)
ptt_specific_variants.head(5)

In [None]:
ptt_specific_variants[ptt_specific_variants['1-spearman'] > 0.001]

In [None]:
ptv_specific_variants = pd.read_csv('in_silico_mutagenesis/ptv_specific_variants.txt', sep = '\t', header = 0)
ptv_specific_variants.head(5)

In [None]:
ptv_specific_variants[ptv_specific_variants['1-spearman'] > 0.001]

## Phenotype Enrichment

In [None]:
fdr_table = []

In [None]:
def reportFDRcorrectedPthreshold(set_name, ontology, q_value_threshold, resolution=0.0001, minStart=0):
    fdr_empiric = pd.read_csv(f'phenotype_enrichment/empiric_FDR/{set_name}_{ontology}_empiric_FDR.txt', sep = '\t', header = None, index_col = 0)
    obs = pd.read_csv(f'phenotype_enrichment/enrichment/{set_name}_{ontology}_enrichment.txt', sep = '\t')

    fdr_threshold = []
    for i in np.arange(minStart,0.05,resolution):
        
        observed_positive = sum(obs['p_value'] <= i)
        average_false_positive = (fdr_empiric <= i).sum().mean()
        q = average_false_positive/observed_positive
        fdr_threshold.append([set_name, ontology, q_value_threshold, i, observed_positive, average_false_positive, q])
        
        if (q != np.inf) & (q > q_value_threshold):
            break
    
    threshold = fdr_threshold[-2]
    fdr_table.append(threshold)
    #fdr_threshold = pd.DataFrame(fdr_threshold, columns = ['pval_threshold','obsPos','avgFalsePos','q'])
    #return fdr_threshold.tail(2).head(1)

In [None]:
combinations = [(set_name,ontology,q_value_threshold) for set_name in ['ppn_pt_clustering','ppn_pt_divergent'] for ontology in ['BP','GWAS','HPO','MP'] for q_value_threshold in [0.05,0.1]]

In [None]:
[reportFDRcorrectedPthreshold(set_name, ontology, q_value_threshold) for set_name, ontology, q_value_threshold in combinations]

In [None]:
fdr_table = pd.DataFrame(fdr_table, columns = ['set', 'ontology', 'q_value_threshold', 'p_value_threshold','observed_positive','average_false_positive','q'])
fdr_table

Let's go ahead and split the first column but keep the window column.

In [None]:
divergent_windows['window_split'] = divergent_windows['window']
divergent_windows = divergent_windows['window_split'].str.split('_', expand=True)
divergent_windows.rename(columns = {0:'chr', 1:'window_start'}, inplace = True)
divergent_windows.head(5)

Let's output these divergent windows. 

In [None]:
divergent_windows['window_start'] = divergent_windows['window_start'].astype(int)

In [None]:
divergent_windows['window_end'] = divergent_windows['window_start'] + 1048576
divergent_windows.head(5)

Note that we do not need to conver the start coordinate here because the windows are already in 0-based coordinates.

In [None]:
divergent_windows.to_csv('divergent_windows/ppn_pt_divergent_windows.bed', sep = '\t', header = False, index = False)

Now let's intersect our 

In [None]:
genes_header = ['chr','window_start','genes']
genes = pd.read_csv('windows/pantro6_windows_with_genes.txt', sep = '\t', header = None, names = genes_header)
genes.head(5)

In [None]:
len(genes)

In [None]:
divergent_windows_genes = pd.merge(divergent_windows, genes, on = ['chr','window_start'])
divergent_windows_genes.head(5)

In [None]:
divergent_windows_genes.to_csv('divergent_windows/ppn_pt_divergent_windows_with_genes.bed', sep = '\t', header = False, index = False)

In [None]:
divergent_genes = divergent_windows_genes['genes']
divergent_genes.head(5)

In [None]:
divergent_genes = divergent_genes.str.split(',').explode().reset_index(drop = True).to_frame('gene')
divergent_genes = divergent_genes.drop_duplicates()
divergent_genes = divergent_genes['gene'].str.strip().dropna() # exploding created white space and at least one NA
divergent_genes = divergent_genes.sort_values(ascending = True)
divergent_genes.head(20)

In [None]:
len(divergent_genes)

In [None]:
divergent_genes.to_csv('divergent_windows/ppn_pt_genes.txt', sep = '\t', header = False, index = False)

## Non-Bonobo Divergent Windows

In [None]:
#get_divergent_windows(['pte-pts','pte-ptt','pts-ptv','ptt-ptv'], ['ppn-pt','ppn'])

In [None]:
#get_divergent_windows(['pte-pts','pts-ptv'], ['ppn-pt','ppn','pte-ptt','pts-ptt','ptt-ptv','ptt'])

In [None]:
#get_divergent_windows(['pts-ptt'], ['ppn-pt','ppn','ptt','pts',])

In [None]:
#get_divergent_windows(['pte'], ['ppn-pt','ppn','ptt','pts'])

In [None]:
#get_divergent_windows(['ptv'], ['ppn-pt','ppn'])

## Sequence Divergence <a class = 'anchor' id = 'sequencedivergence'></a>

In [None]:
rho, p = spearmanr(comparisons['divergence'], comparisons['seq_diff'])
print(rho,p)

## TADs

In [None]:
TADs_header = ['chr','start','end','score','C3624_overlap','C3649_overlap','C3651_overlap','C40300_overlap']
TADs = pd.read_csv('annotations/panTro6_TADs.bed', sep = '\t', header = None, names = TADs_header)
TADs.head(5)

In [None]:
len(TADs)

In [None]:
#TADs_BED = pybedtools.BedTool().from_dataframe(TADs)
#TADs_BED.head(5)

In [None]:
#ppn_pt_divergent_loci_header = ['chr','end','ref_allele','alt_allele']
#ppn_pt_divergent_loci = pd.read_csv('variant_loci/subset_ppn_specific_loci.txt', sep = '\t', header = None, names = ppn_pt_divergent_loci_header)
#ppn_pt_divergent_loci.head(5)

In [None]:
#ppn_pt_divergent_loci['start'] = ppn_pt_divergent_loci['end'] - 1
#ppn_pt_divergent_loci = ppn_pt_divergent_loci[['chr','start','end','ref_allele','alt_allele']]
#ppn_pt_divergent_loci.head(5)

In [None]:
#ppn_pt_divergent_loci_BED = pybedtools.BedTool().from_dataframe(ppn_pt_divergent_loci)
#ppn_pt_divergent_loci_BED.head(5)

In [None]:
#TADs_ppn_pt_variant_intersection = TADs_BED.intersect(ppn_pt_divergent_loci_BED, c = True).to_dataframe(names=['chr','start','end','score','C3624_overlap','C3649_overlap','C3651_overlap','C40300_overlap','N_variants'])
#TADs_ppn_pt_variant_intersection['length'] = (TADs_ppn_pt_variant_intersection['end'] - TADs_ppn_pt_variant_intersection['start'])
#TADs_ppn_pt_variant_intersection['variants/bp'] = TADs_ppn_pt_variant_intersection['N_variants'] / TADs_ppn_pt_variant_intersection['length']
#TADs_ppn_pt_variant_intersection.head(5)

In [None]:
#TADs_ppn_pt_variant_intersection['length'].min()

In [None]:
#TADs_ppn_pt_variant_intersection['length'].mean()

In [None]:
#TADs_ppn_pt_variant_intersection['length'].max()

In [None]:
#TADs_ppn_pt_variant_intersection['variants/bp'].min()

In [None]:
#TADs_ppn_pt_variant_intersection['variants/bp'].mean()

In [None]:
#TADs_ppn_pt_variant_intersection['variants/bp'].max()

## Individually Driven Windows <a class = 'anchor' id = 'individuallydrivenwindows'></a>

Some of these highly divergent windows appear to be driven by individuals that are different to all others, regardless of population. Let's try to write a function to identify windows where a single individual is very divergent. Start with a list of unique windows.

In [None]:
windows_list = comparisons['window'].unique().tolist()

Windows where a single individual is unique should be characterized by 55 comparisons that include that individual as ind1 or ind2 with a divergence score higher than all other comparisons. Therefore, we can sort each window by decreasing divergence and count the number of times each individual appears in the top 55 comparisons. If this count = 55, we have an individually driven window ('max'). We should also consider the mean value for those 55 comparisons because IDWs should have a relatively large value ('55_mean'). Further, there should be a considerable difference in the divergence score between the 55th and 56th comparisons ('55_56_diff'). Finally, the variance of the divergence score for the entire window should be large for IDWs ('window_variance').

In [None]:
def individual_driven_windows():
    
    counts_list = []
    means_list = []
    diffs_list = []
    variance_list = []
    
    for window in windows_list:
        subset = comparisons[comparisons['window'] == window].sort_values(by = 'divergence', ascending = False).reset_index()
        top_subset = subset[0:55]
        id1 = top_subset.groupby(['ind1']).size().reset_index(name='N').rename(columns={'ind1': 'ind'})
        id2 = top_subset.groupby(['ind2']).size().reset_index(name='N').rename(columns={'ind2': 'ind'})
        id_total = pd.concat([id1, id2]).groupby(['ind']).sum().reset_index()
        window_dict = dict(zip(id_total.ind, id_total.N))
        counts_list.append(window_dict)
        
        subset_mean = subset['divergence'].iloc[0:55].mean()
        means_list.append(subset_mean)
        
        subset_diff = subset['divergence'].iloc[54] - subset['divergence'].iloc[55]
        diffs_list.append(subset_diff)
        
        variance = subset['divergence'].var()
        variance_list.append(variance)
        
    df = pd.DataFrame.from_dict(counts_list)
    df['window'] = windows_list
    df['max'] = df[['Akwaya-Jean','Alfred','Alice','Andromeda','Athanga','Berta','Bihati','Blanquita','Bono','Bosco','Brigitta','Bwamble','Cindy-schwein','Cindy-troglodytes','Cindy-verus','Cleo','Clint','Coco-chimp','Damian','Desmond','Doris','Dzeeta','Frederike','Gamin','Hermien','Hortense','Ikuru','Jimmie','Julie-A959','Julie-LWC21','Kidongo','Koby','Kombote','Kosana','Koto','Kumbuka','Lara','Linda','Luky','Marlin','Maya','Mgbadolite','Mirinda','Nakuu','Natalie','Negrita','SeppToni','Taweh','Tibe','Tongo','Trixie','Ula','Vaillant','Vincent','Washu','Yogui']].max(axis=1)
    df['55_mean'] = means_list
    df['55_56_diff'] = diffs_list
    df['window_variance'] = variance_list
    df = df[['window','Akwaya-Jean','Alfred','Alice','Andromeda','Athanga','Berta','Bihati','Blanquita','Bono','Bosco','Brigitta','Bwamble','Cindy-schwein','Cindy-troglodytes','Cindy-verus','Cleo','Clint','Coco-chimp','Damian','Desmond','Doris','Dzeeta','Frederike','Gamin','Hermien','Hortense','Ikuru','Jimmie','Julie-A959','Julie-LWC21','Kidongo','Koby','Kombote','Kosana','Koto','Kumbuka','Lara','Linda','Luky','Marlin','Maya','Mgbadolite','Mirinda','Nakuu','Natalie','Negrita','SeppToni','Taweh','Tibe','Tongo','Trixie','Ula','Vaillant','Vincent','Washu','Yogui','max','55_mean','55_56_diff','window_variance']]
    return df

#individual_driven_windows_df = individual_driven_windows()

In [None]:
#individual_driven_windows_df

Save dataframe.

In [None]:
#individual_driven_windows_df.to_csv('dataframes/individual_driven_windows_dataframe.txt', sep = '\t', header = True, index = False)

Load the dataframe.

In [None]:
individual_driven_windows_df = pd.read_csv('dataframes/individual_driven_windows_dataframe.txt', sep = '\t', header = 0)
individual_driven_windows_df.head(10)

In [None]:
individual_driven_windows_df[(individual_driven_windows_df['max'] >= 55) & (individual_driven_windows_df['55_mean'] >= 0.3)]

In [None]:
len(individual_driven_windows_df[(individual_driven_windows_df['max'] >= 55) & (individual_driven_windows_df['55_mean'] >= 0.3)])

In [None]:
len(individual_driven_windows_df[(individual_driven_windows_df['max'] >= 55)])

Let's assess how many of these comparisons are 'highly divergent' or >= 0.3.

In [None]:
IDWs_list = [['chr11_20971520','Jimmie'],
        ['chr14_26738688','Luky'],
        ['chr1_72351744','Lara'],
        ['chr1_168820736','Berta'],
        ['chr1_169345024','Berta'],
        ['chr2A_76021760','Coco-chimp'],
        ['chr4_82837504','Frederike'],
        ['chr5_95420416','Desmond'],
        ['chr6_142606336','Bono'],
        ['chr7_105906176','Alice'],
        ['chr8_112197632','Athanga'],
        ['chr8_128974848','Damian']]

In [None]:
def retrieve_IDWs():
    df_rows = []
    for window, ind in IDWs_list:
        match = comparisons[(comparisons['divergence'] >= 0.3) & (comparisons['window'] == window) & (comparisons['ind1'] == ind) | (comparisons['divergence'] >= 0.3) & (comparisons['window'] == window) & (comparisons['ind2'] == ind)]
        df_rows.append(match)
    IDWs = pd.concat(df_rows)
    return IDWs

In [None]:
IDWs_df = retrieve_IDWs()

In [None]:
IDWs_df.groupby(['window'])['divergence'].count().to_frame('N')

In [None]:
41+44+55+55+52+55+55+49+46+55+55+55 

In [None]:
617/5251

Now let's estimate how many IDWs make up the highly divergent windows per 0.02 divergence score bin. We only need the raw data to plot using ggplot so let's start with that. We need to create two levels (IDW and all) to distinguish the two. 

In [None]:
high_divergence_dist = comparisons[comparisons['divergence'] >= 0.3]['divergence'].to_frame('divergence')
high_divergence_dist['type'] = 'all'
high_divergence_dist = high_divergence_dist[['type','divergence']]
high_divergence_dist.head()

Now the IDWs. We can use the function from before after updating our IDWs_list.

In [None]:
IDWs_list = [['chr11_20971520','Jimmie'],
        ['chr14_26738688','Luky'],
        ['chr1_72351744','Lara'],
        ['chr1_168820736','Berta'],
        ['chr1_169345024','Berta'],
        ['chr2A_76021760','Coco-chimp'],
        ['chr4_82837504','Frederike'],
        ['chr5_95420416','Desmond'],
        ['chr6_142606336','Bono'],
        ['chr7_105906176','Alice'],
        ['chr8_112197632','Athanga'],
        ['chr8_128974848','Damian']]

In [None]:
IDWs_df = retrieve_IDWs()

Check that everything made it during filtering.

In [None]:
len(IDWs_df)

In [None]:
IDWs_dist = IDWs_df[IDWs_df['divergence'] >= 0.3]['divergence'].to_frame('divergence')
IDWs_dist['type'] = 'IDW'
IDWs_dist = IDWs_dist[['type','divergence']]
IDWs_dist.head()

In [None]:
len(IDWs_dist)

Concat the two dataframes.

In [None]:
high_divergence_IDWs_dist = pd.concat([high_divergence_dist,IDWs_dist], axis = 0)
high_divergence_IDWs_dist

Save dataframe for plotting.

In [None]:
high_divergence_IDWs_dist.to_csv('dataframes/high_divergence_IDWs_dist.txt', sep = '\t', header = True, index = False)

## IDW In Silico Mutagenesis <a class = 'anchor' id = 'idwindividuallydrivenwindows'></a>

Let's read in the ISM IDW data. Note that 'pos' in any in silico mutagenesis analysis is noted in 1-based coordinates.  

In [None]:
in_silico_IDW = pd.read_csv('in_silico_mutagenesis/IDW_variants.txt', sep = '\t', header = 0)
in_silico_IDW.head(5)

In [None]:
len(in_silico_IDW)

Filter for variants with divergence score >= 0.01.

In [None]:
in_silico_IDW[in_silico_IDW['1-spearman'] >= 0.01]

Each of the IDWs has a variant private that that individual that induces a major change in chromatin contact compared to the chimpanzee reference. Do these fall within CREs or CTCF binding sites?

In [None]:
IDW_3d_modifying_variants = in_silico_IDW[in_silico_IDW['1-spearman'] >= 0.01]
IDW_3d_modifying_variants_BED = IDW_3d_modifying_variants[['chr','pos']]
IDW_3d_modifying_variants_BED = IDW_3d_modifying_variants_BED.rename(columns={'pos': 'end'})
IDW_3d_modifying_variants_BED['start'] = IDW_3d_modifying_variants_BED['end']-1
IDW_3d_modifying_variants_BED = IDW_3d_modifying_variants_BED[['chr','start','end']]
IDW_3d_modifying_variants_BED = pybedtools.BedTool().from_dataframe(IDW_3d_modifying_variants_BED).sort()
IDW_3d_modifying_variants_BED.head()

In [None]:
CTCF = pybedtools.BedTool('annotations/panTro6_CTCF.bed')
enhancers = pybedtools.BedTool('annotations/panTro6_enhancers.bed')
promoters = pybedtools.BedTool('annotations/panTro6_promoters.bed')

In [None]:
IDW_variants_CTCF_intersect = IDW_3d_modifying_variants_BED.intersect(CTCF, u = True)
IDW_variants_enhancers_intersect = IDW_3d_modifying_variants_BED.intersect(enhancers, u = True)
IDW_variants_promoters_intersect = IDW_3d_modifying_variants_BED.intersect(promoters, u = True)

In [None]:
len(IDW_variants_CTCF_intersect)

In [None]:
IDW_variants_CTCF_intersect.head()

In [None]:
len(IDW_variants_enhancers_intersect)

In [None]:
len(IDW_variants_promoters_intersect)

Now let's check these variants. How divergent are the predictions with the 3d modifying variant compared to the individual with the IDW? Let's create and export a dataframe.

In [None]:
IDW_3d_modifying_variants = IDW_3d_modifying_variants[['chr','pos','ref','alt','window']]
IDW_3d_modifying_variants['individual'] = ['Lara','Berta','Berta','Jimmie','Luky','Coco-chimp','Frederike','Desmond','Bono','Alice','Athanga','Damian']
IDW_3d_modifying_variants.head(12)

In [None]:
IDW_3d_modifying_variants.to_csv('IDWs/IDW_3d_modifying_variants.txt', sep = '\t', header = False, index = False)

Run the IDW_3d_modifying_variant_prediction_and_comparison script and read in the results.

## Compare Cell Types <a class = 'anchor' id = 'comparecelltypes'></a>

Are the cell type specific predictions variable across cell types for the reference sequence?

In [None]:
#reference_comparisons_header = ['cell_type_1','cell_type_2','chr','window_start','mse','spearman']
#reference_comparisons = pd.read_csv('comparisons/reference/all_reference_comparisons.txt', sep = '\t', names = reference_comparisons_header)
#reference_comparisons['window'] = reference_comparisons['chr'] + '_' + reference_comparisons['window_start'].astype(str)
#reference_comparisons = reference_comparisons[['cell_type_1','cell_type_2','chr','window_start','window','mse','spearman']]
#reference_comparisons.head()

In [None]:
#excluded_header = ['chr','start','end','N_missing']
#excluded = pd.read_csv('metadata/panTro6_excluded_windows.bed', sep = '\t', header = None, names = excluded_header)
#excluded['window'] = excluded['chr'] + '_' + excluded['start'].astype(str)
#excluded_windows = excluded['window'].tolist()

In [None]:
#reference_comparisons = reference_comparisons[~(reference_comparisons['window'].isin(excluded_windows))]
#reference_comparisons.head()

Let's save this dataframe for plotting.

In [None]:
#reference_comparisons.to_csv('dataframes/reference_cell_type_comparisons.txt', sep = '\t', header = True, index = False)

In [None]:
#len(reference_comparisons)

In [None]:
#reference_comparisons.groupby(['cell_type_1','cell_type_2'])['spearman'].mean().to_frame('Mean Rho')

Most of the maps for the different cell types are nearly identical when predicting the reference sequence.

Let's compare sample predictions from HFF to GM12878. Load the data and run a correlation.

In [None]:
#GM12878_comparisons = pd.read_csv('dataframes/GM12878_comparisons.txt', sep = '\t', header = 0)
#GM12878_comparisons.head(5)

In [None]:
#rho, p = spearmanr(comparisons['divergence'], GM12878_comparisons['divergence'])
#print(rho, p)

In [None]:
#HFF_divergence = comparisons[['ind1','ind2','window','divergence']].copy()
#HFF_divergence['cell_type'] = 'HFF'
#GM12878_divergence = GM12878_comparisons[['ind1','ind2','window','divergence']].copy()
#GM12878_divergence['cell_type'] = 'GM12878'
#cell_type_correlation = HFF_divergence.merge(GM12878_divergence, how = 'outer', on = ['ind1','ind2','window'])
#cell_type_correlation = cell_type_correlation[['ind1','ind2','window','cell_type_x','divergence_x','cell_type_y','divergence_y']]
#cell_type_correlation.head()

In [None]:
#len(cell_type_correlation)

Save this dataframe for plotting.

In [None]:
#cell_type_correlation.to_csv('dataframes/cell_type_correlation.txt', sep = '\t', header = True, index = False)

After plotting, there is a noticeable cluster of windows with low HFF divergence but high GM12878 divergence. Let's take a look.

In [None]:
#high_GM12878_windows = cell_type_correlation[(cell_type_correlation['divergence_x'] < 0.075) & (cell_type_correlation['divergence_y'] > 0.1)]
#high_GM12878_windows.head(200)

In [None]:
#len(high_GM12878_windows)

In [None]:
#high_GM12878_windows.groupby('window')['divergence_y'].count().to_frame()

Two windows stick out in particular. Are these maps very different in the reference sequence?

In [None]:
#reference_comparisons[(reference_comparisons['cell_type_1'] == 'GM12878') & (reference_comparisons['cell_type_2'] == 'HFF') & (reference_comparisons['window'] == 'chr4_94371840')]

In [None]:
#reference_comparisons[(reference_comparisons['cell_type_1'] == 'GM12878') & (reference_comparisons['cell_type_2'] == 'HFF') & (reference_comparisons['window'] == 'chr11_23068672')]

double and quadruple windows

'chr2A_21495808','Bono'], ['chr2A_21495808','Dzeeta'], ['chr2A_21495808','Kombote'], ['chr2A_21495808','Kumbuka'], ['chr2A_22020096','Bono'], ['chr2A_22020096','Dzeeta'], ['chr2A_22020096','Kombote'], ['chr2A_22020096','Kumbuka'], ['chr10_52428800','Julie-A959'], ['chr10_52428800','Vaillant']] chr13_49807360, lara, ikuru


## TADs

In [None]:
TADs_header = ['chr','start','end','score','C3624_overlap','C3649_overlap','C3651_overlap','C40300_overlap']
TADs = pd.read_csv('annotations/panTro6_TADs.bed', sep = '\t', header = None, names = TADs_header)
TADs.head(5)

In [None]:
len(TADs)

In [None]:
TADs.groupby(['chr'])['score'].count().to_frame('N')

In [None]:
TAD_lengths = TADs['end'] - TADs['start']
TAD_lengths

In [None]:
TAD_lengths.mean()

In [None]:
TAD_lengths.median()

In [None]:
TAD_lengths.mode()

In [None]:
import seaborn as sns

In [None]:
sns.kdeplot(np.array(TAD_lengths), bw=0.5)

Let's make a pybedtools BED file of our TADs.

In [None]:
TADs_BED = TADs[['chr','start','end']]
TADs_pbtBED = pybedtools.BedTool().from_dataframe(TADs_BED).sort()
TADs_pbtBED.head()

In [None]:
len(TADs_pbtBED)

Save this dataframe.

In [None]:
TADs_BED.to_csv('annotations/panTro6_TAD_coordinates.bed', sep = '\t', header = False, index = False)

In [None]:
comparisons[(comparisons['window'] == 'chr2A_74973184') & (comparisons['ind1'] == 'Cleo') & (comparisons['ind2'] == 'Natalie')]

Do any of these windows occur on chromosome 7? The Eres et al. 2019 TAD dataset did not include any TADs from chromosome 7.

In [None]:
ppn_pt_chr7_clustering_windows = ppn_pt_clustering_windows_BED[ppn_pt_clustering_windows_BED['chr'] == 'chr7']
ppn_pt_chr7_clustering_windows

Let's take a look at these windows and assess how many non-ppn-pt dyads have a higher divergence than the ppn-pt dyad with the lowest divergence per window. In other words, how well do the ppn-pt and non-ppn-pt distributions separate? We will not consider within bonobo variation. 

In [None]:
def ppn_pt_dyad_separation_autosomes():
    
    counts_list = []
    
    for window in ppn_pt_clustering_autosomal_windows['window']:
        subset = comparisons[comparisons['window'] == window].sort_values(by = 'divergence', ascending = False).reset_index(drop = True)
        subset = subset[subset['dyad_type'] != 'ppn'].reset_index(drop = True)
        index = subset.index[subset['dyad_type'] == 'ppn-pt'][422]
        counts_list.append((index + 1) - 423)

    df = pd.DataFrame()
    df['window'] = ppn_pt_clustering_autosomal_windows['window']
    df['n_non_ppn_pt_dyads'] = counts_list
    return df

ppn_pt_dyad_separation_autosomes_df = ppn_pt_dyad_separation_autosomes()

In [None]:
def ppn_pt_dyad_separation_chrX():
    
    counts_list = []
    
    for window in ppn_pt_clustering_chrX_windows['window']:
        subset = comparisons[comparisons['window'] == window].sort_values(by = 'divergence', ascending = False).reset_index(drop = True)
        subset = subset[subset['dyad_type'] != 'ppn'].reset_index(drop = True)
        index = subset.index[subset['dyad_type'] == 'ppn-pt'][202]
        counts_list.append((index + 1) - 203)

    df = pd.DataFrame()
    df['window'] = ppn_pt_clustering_chrX_windows['window']
    df['n_non_ppn_pt_dyads'] = counts_list
    return df

ppn_pt_dyad_separation_chrX_df = ppn_pt_dyad_separation_chrX()

Merge these counts so that we can plot them in another notebook.

In [None]:
ppn_pt_dyad_separation_counts = pd.concat([ppn_pt_dyad_separation_autosomes_df, ppn_pt_dyad_separation_chrX_df])
ppn_pt_dyad_separation_counts.head(5)

In [None]:
len(ppn_pt_dyad_separation_counts)

In [None]:
ppn_pt_dyad_separation_counts.to_csv('dataframes/ppn_pt_dyad_separation_counts.txt', sep = '\t', header = False, index = False)

Now for the fully divergent subset. Let's write and apply a function.

In [None]:
def get_divergent_windows(target, ignored):
    subset = comparisons[~comparisons['dyad_type'].isin(ignored)]
    others_maxes = subset[~subset['dyad_type'].isin(target)].groupby(['window'])['divergence'].max().to_frame('max').reset_index()
    target_mins = subset[subset['dyad_type'].isin(target)].groupby(['window'])['divergence'].min().to_frame('min').reset_index()
    all_windows = pd.merge(others_maxes, target_mins, on = 'window')
    divergent_windows = all_windows[all_windows['min'] > all_windows['max']]
    
    return divergent_windows

In [None]:
ppn_pt_divergent_windows = get_divergent_windows(['ppn-pt'], ['ppn'])
ppn_pt_divergent_windows.head(65)

In [None]:
len(ppn_pt_divergent_windows)

Make a quick list of these windows and a dictionary from the maxes here using the window as the index. We'll need them later for identifying 3d-modifying variants.

In [None]:
ppn_pt_divergent_windows_list = ppn_pt_divergent_windows['window'].tolist()
ppn_pt_divergent_windows_chimp_maxes = pd.Series(ppn_pt_divergent_windows['max'].values, index = ppn_pt_divergent_windows['window']).to_dict()

Save the divergent windows as a BED file.

In [None]:
ppn_pt_divergent_windows_BED = pd.DataFrame()
ppn_pt_divergent_windows_BED['window_split'] = ppn_pt_divergent_windows['window']
ppn_pt_divergent_windows_BED = ppn_pt_divergent_windows_BED['window_split'].str.split('_', expand=True)
ppn_pt_divergent_windows_BED.rename(columns = {0:'chr', 1:'window_start'}, inplace = True)
ppn_pt_divergent_windows_BED['window_start'] = ppn_pt_divergent_windows_BED['window_start'].astype(int)
ppn_pt_divergent_windows_BED['window_end'] = ppn_pt_divergent_windows_BED['window_start'] + 1048576
ppn_pt_divergent_windows_BED = ppn_pt_divergent_windows_BED.sort_values(by = 'chr')
ppn_pt_divergent_windows_BED.head(5)

In [None]:
ppn_pt_divergent_windows_BED.to_csv('divergent_windows/ppn_pt_divergent_windows.bed', sep = '\t', header = False, index = False)

Check for chromosome 7 windows again.

In [None]:
ppn_pt_chr7_divergent_windows_BED = ppn_pt_divergent_windows_BED[ppn_pt_divergent_windows_BED['chr'] == 'chr7']
ppn_pt_chr7_divergent_windows_BED

In [None]:
ppn_pt_divergent_windows['window'].isin(ppn_pt_clustering_windows['window'])

In [None]:
ppn_pt_divergent_windows.loc[3518]

In [None]:
autosomal_complete_linkage_trees[autosomal_complete_linkage_trees['window'] == 'chr6_46137344']

In [None]:
windows = comparisons[['chr','window_start']]
windows = windows.drop_duplicates()
windows['window_end'] = windows['window_start'] + 1048576
windows.head(5)

In [None]:
windows_pbtBED = pybedtools.BedTool().from_dataframe(windows).sort()
windows_pbtBED.head()

In [None]:
genes_pbtBED = pybedtools.BedTool('annotations/panTro6_genes.bed')
genes_pbtBED.head(5)

In [None]:
windows_genes_intersection = windows_pbtBED.intersect(genes_pbtBED, c = True).to_dataframe(names=['window_chr','window_start','window_end','count'])
windows_genes_intersection.head(5)

In [None]:
windows_genes_intersection.groupby(['count']).size().to_frame('N')

In [None]:
ppn_pt_clustering_windows_pbtBED = pybedtools.BedTool().from_dataframe(ppn_pt_clustering_windows_BED).sort()
ppn_pt_clustering_windows_pbtBED.head()

In [None]:
ppn_pt_clustering_windows_genes_intersection.groupby('count').count()

In [None]:
#ppn_pt_clustering_windows_genes_intersection = ppn_pt_clustering_windows_pbtBED.intersect(genes_pbtBED, wa = True, wb = True).to_dataframe(names=['window_chr','window_start','window_end','gene_chr','gene_start','gene_end','transcript','gene'])
#ppn_pt_clustering_windows_genes_intersection.head(5)

In [None]:
ppn_pt_clustering_windows_with_genes = ppn_pt_clustering_windows_genes_intersection[['window_chr','window_start']]
ppn_pt_clustering_windows_with_genes = ppn_pt_clustering_windows_with_genes.drop_duplicates()

In [None]:
len(ppn_pt_clustering_windows_with_genes)