# Analysis

This notebook contains code to analyze putative splicing altering variants in four archaic hominin genomes. The initial code "picks up" after the previous notebook. If you wish to run the analyses below using the full dataframe, load the libraries, navigate to your desired directory, and load the full dataframe starting [here](#startwithdataframe).

I have organized the analyses that follow to largely reflect the order in which they appear in the manuscript. Feel free to navigate to a specific section using the Table of Contents below.

# Table of Contents
- [Load Dataframe](#loaddataframe)
- [Load sQTLs](#loadsqtls)
- [Spliceosome Variants](#spliceosomevariants)
- [Data Description](#datadescription)
    - [Multiple Annotations](#multipleannotations)
    - [Multiple Alleles](#multiplealleles)
    - [Genotypes](#genotypes)
    - [Annotations](#annotations)
    - [Delta Thresholds](#deltathresholds)
    - [Variant Distribution](#variantdistribution)
    - [Delta Correlations](#deltacorrelations)
    - [Multiple Deltas](#multipledeltas)
    - [SAV Genotypes](#SAVgenotypes)
    - [SAV Allele Origin](#SAValleleorigin)
    - [Introgression Set Overlap](#introgressionsetoverlap)
    - [SAV Distribution](#SAVdistribution)
- [SAV Genes](#SAVEgenes)
    - [Top 20 AG, AL, DG, and DL](#top20)
    - [N Genes](#ngenes)
    - [N Genes Distribution](#ngenesdistribution)
    - [N Genes by Origin](#ngenesbyorigin)
    - [N Genes by Distribution](#ngenesbydistribution)
    - [Archaic-Specific Genes](#archaicspecificgenes)
    - [Gene Overlap](#geneoverlap)
- [Comparisons](#comparions)
    - [SAV Genes and DR genes](#drgenes)
    - [SAV Genes and circadian genes](#circadiangenes)
- [Gene Enrichment](#geneenrichment)
- [Gene Characteristics](#genecharacteristics)
    - [N Exons, CDS Length, and Gene Length](#physical)
    - [N Isoforms](#isoforms)
    - [Constraint and Conservation](#constraintconservation)
- [Gene Expression](#geneexpression)
    - [Gene-Level](#genelevel)
    - [sQTLs](#sqtls)
- [Purifying Selection](#purifyingselection)
    - [Varied Deltas](#varieddeltas)
    - [Lineage-Specific](#lineagespecific)
- [SAVs in Moderns](#SAVsinmoderns)
    - [Introgressed SAV Distribution](#introgressedSAVdistribution)
    - [Allele Frequency and Max Delta](#allelefrequencymaxdelta)
    - [Introgressed Genes](#introgressedgenes)
    - [ASE](#ASE)
- [Genes of Evolutionary Significance](#evolutionarysignificantgenes)
- [Average SAVs in 1KG](#avg1KGSAVs)
- [Introgressed Reference Alleles](#introgressedrefs)

Load libraries and change directories.

In [1]:
from pingouin import partial_corr
from math import sqrt
import numpy as np
import pandas as pd
import re
from scipy.stats import chi2
from scipy.stats import fisher_exact
from scipy.stats import kruskal
from scipy.stats import mannwhitneyu
from scipy.stats import spearmanr
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 110)

In [2]:
cd ../../data/dataframes

/wynton/group/capra/projects/archaic_splicing/data/dataframes


# Load Dataframe <a class = 'anchor' id = 'loaddataframe'></a>

In [3]:
data = pd.read_csv("archaic_data_with_constraint_moderns_introgression.txt", sep='\t', header=0)
data.head(10)

  data = pd.read_csv("archaic_data_with_constraint_moderns_introgression.txt", sep='\t', header=0)


Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_allele_origin,Vernot_introgressed_AF,Browning_introgressed,Browning_allele_origin,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos
0,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20
1,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14
2,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.197,0.0,0.0,0.0,0.0,0.0,3,28,-38,41
3,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.197,0.0,0.0,0.0,0.0,0.0,-6,-50,24,-48
4,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34
5,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3
6,chr1,862124,A,G,G,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3485.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.316764,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-3.75,0.0,0.0,0.0,0.0,0.0,-24,26,-11,3
7,chr1,862124,A,G,G,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3485.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.316764,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-3.75,0.0,0.0,0.0,0.0,0.0,17,-14,-28,35
8,chr1,862383,C,T,C,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3476.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.310916,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.932,0.0,0.0,0.0,0.0,0.0,22,-26,-21,29
9,chr1,862383,C,T,C,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3476.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.310916,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.932,0.0,0.0,0.0,0.0,0.0,-23,7,28,46


In [4]:
len(data)

1607350

# Load sQTLs <a class = 'anchor' id = 'loadsqtls'></a>

Let's immediately add our sQTLs and make sure we're only adding sQTLs that match the ref and alt allele. Load the sQTL dataframe.

In [5]:
sQTLs_header = ['chrom','start','pos','sQTL_ref_allele','sQTL_alt_allele','Adipose_Subcutaneous','Adipose_Visceral_Omentum','Adrenal_Gland','Artery_Aorta','Artery_Coronary','Artery_Tibial','Brain_Amygdala','Brain_Anterior_cingulate_cortex_BA24','Brain_Caudate_basal_ganglia','Brain_Cerebellar_Hemisphere','Brain_Cerebellum','Brain_Cortex','Brain_Frontal_Cortex_BA9','Brain_Hippocampus','Brain_Hypothalamus','Brain_Nucleus_accumbens_basal_ganglia','Brain_Putamen_basal_ganglia','Brain_Spinal_cord_cervical_c-1','Brain_Substantia_nigra','Breast_Mammary_Tissue','Cells_Cultured_fibroblasts','Cells_EBV-transformed_lymphocytes','Colon_Sigmoid','Colon_Transverse','Esophagus_Gastroesophageal_Junction','Esophagus_Mucosa','Esophagus_Muscularis','Heart_Atrial_Appendage','Heart_Left_Ventricle','Kidney_Cortex','Liver','Lung','Minor_Salivary_Gland','Muscle_Skeletal','Nerve_Tibial','Ovary','Pancreas','Pituitary','Prostate','Skin_Not_Sun_Exposed_Suprapubic','Skin_Sun_Exposed_Lower_leg','Small_Intestine_Terminal_Ileum','Spleen','Stomach','Testis','Thyroid','Uterus','Vagina','Whole_Blood']
sQTLs = pd.read_csv('../GTEx_sQTLs/sQTLs_hg19.txt', sep = '\t', names = sQTLs_header)
sQTLs.head(10)

Unnamed: 0,chrom,start,pos,sQTL_ref_allele,sQTL_alt_allele,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood
0,chr1,861807,861808,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,chr1,861807,861808,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,chr1,862092,862093,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,chr1,862092,862093,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,chr1,862123,862124,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,chr1,862123,862124,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,chr1,862382,862383,C,T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,chr1,862382,862383,C,T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,chr1,862388,862389,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,chr1,862388,862389,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Add a column with the GTEx tissue N per variant.

In [6]:
sQTLs['N_GTEx_tissues'] = sQTLs[['Adipose_Subcutaneous','Adipose_Visceral_Omentum','Adrenal_Gland','Artery_Aorta','Artery_Coronary','Artery_Tibial','Brain_Amygdala','Brain_Anterior_cingulate_cortex_BA24','Brain_Caudate_basal_ganglia','Brain_Cerebellar_Hemisphere','Brain_Cerebellum','Brain_Cortex','Brain_Frontal_Cortex_BA9','Brain_Hippocampus','Brain_Hypothalamus','Brain_Nucleus_accumbens_basal_ganglia','Brain_Putamen_basal_ganglia','Brain_Spinal_cord_cervical_c-1','Brain_Substantia_nigra','Breast_Mammary_Tissue','Cells_Cultured_fibroblasts','Cells_EBV-transformed_lymphocytes','Colon_Sigmoid','Colon_Transverse','Esophagus_Gastroesophageal_Junction','Esophagus_Mucosa','Esophagus_Muscularis','Heart_Atrial_Appendage','Heart_Left_Ventricle','Kidney_Cortex','Liver','Lung','Minor_Salivary_Gland','Muscle_Skeletal','Nerve_Tibial','Ovary','Pancreas','Pituitary','Prostate','Skin_Not_Sun_Exposed_Suprapubic','Skin_Sun_Exposed_Lower_leg','Small_Intestine_Terminal_Ileum','Spleen','Stomach','Testis','Thyroid','Uterus','Vagina','Whole_Blood']].sum(axis=1)
sQTLs.head(10)

Unnamed: 0,chrom,start,pos,sQTL_ref_allele,sQTL_alt_allele,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues
0,chr1,861807,861808,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,chr1,861807,861808,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,chr1,862092,862093,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,chr1,862092,862093,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
4,chr1,862123,862124,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
5,chr1,862123,862124,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
6,chr1,862382,862383,C,T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
7,chr1,862382,862383,C,T,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
8,chr1,862388,862389,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
9,chr1,862388,862389,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


Merge the data using an indicator.

In [7]:
data = pd.merge(data, sQTLs, on = ['chrom','pos'], how = 'left', indicator = True)
data = data.drop(columns = ['start'])
data.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_allele_origin,Vernot_introgressed_AF,Browning_introgressed,Browning_allele_origin,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,sQTL_ref_allele,sQTL_alt_allele,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,_merge
0,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
1,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
2,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
3,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
4,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.197,0.0,0.0,0.0,0.0,0.0,3,28,-38,41,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,left_only
5,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.197,0.0,0.0,0.0,0.0,0.0,-6,-50,24,-48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,left_only
6,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
7,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
8,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both
9,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both


Classify variants as an sQTL or not depending on whether the position and ref/alt alleles match.

In [8]:
def sQTL(data):
    if (data['_merge'] == 'left_only'):
        return 'no'
    elif (data['_merge'] == 'both') & (data['ref_allele'] != data['sQTL_ref_allele']) & (data['alt_allele'] != data['sQTL_alt_allele']):
        return 'no'
    elif (data['_merge'] == 'both') & (data['ref_allele'] != data['sQTL_ref_allele']) & (data['alt_allele'] == data['sQTL_alt_allele']):
        return 'no'
    elif (data['_merge'] == 'both') & (data['ref_allele'] == data['sQTL_ref_allele']) & (data['alt_allele'] != data['sQTL_alt_allele']):
        return 'no'
    elif (data['_merge'] == 'both') & (data['ref_allele'] == data['sQTL_ref_allele']) & (data['alt_allele'] == data['sQTL_alt_allele']):
        return 'yes'

data['sQTL'] = data.apply(sQTL, axis = 1)

In [9]:
data.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_allele_origin,Vernot_introgressed_AF,Browning_introgressed,Browning_allele_origin,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,sQTL_ref_allele,sQTL_alt_allele,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,_merge,sQTL
0,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
1,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
2,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
3,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14,A,G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
4,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.197,0.0,0.0,0.0,0.0,0.0,3,28,-38,41,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,left_only,no
5,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.197,0.0,0.0,0.0,0.0,0.0,-6,-50,24,-48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,left_only,no
6,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
7,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
8,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes
9,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3,T,C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,both,yes


Let's also set the GTEx columns to NaN for variants that are not sQTLs.

In [10]:
data.loc[data['sQTL'] == 'no',['Adipose_Subcutaneous','Adipose_Visceral_Omentum','Adrenal_Gland','Artery_Aorta','Artery_Coronary','Artery_Tibial','Brain_Amygdala','Brain_Anterior_cingulate_cortex_BA24','Brain_Caudate_basal_ganglia','Brain_Cerebellar_Hemisphere','Brain_Cerebellum','Brain_Cortex','Brain_Frontal_Cortex_BA9','Brain_Hippocampus','Brain_Hypothalamus','Brain_Nucleus_accumbens_basal_ganglia','Brain_Putamen_basal_ganglia','Brain_Spinal_cord_cervical_c-1','Brain_Substantia_nigra','Breast_Mammary_Tissue','Cells_Cultured_fibroblasts','Cells_EBV-transformed_lymphocytes','Colon_Sigmoid','Colon_Transverse','Esophagus_Gastroesophageal_Junction','Esophagus_Mucosa','Esophagus_Muscularis','Heart_Atrial_Appendage','Heart_Left_Ventricle','Kidney_Cortex','Liver','Lung','Minor_Salivary_Gland','Muscle_Skeletal','Nerve_Tibial','Ovary','Pancreas','Pituitary','Prostate','Skin_Not_Sun_Exposed_Suprapubic','Skin_Sun_Exposed_Lower_leg','Small_Intestine_Terminal_Ileum','Spleen','Stomach','Testis','Thyroid','Uterus','Vagina','Whole_Blood','N_GTEx_tissues']] = np.nan

Check that that worked.

In [11]:
data.groupby(['sQTL','N_GTEx_tissues']).size().to_frame('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sQTL,N_GTEx_tissues,Unnamed: 2_level_1
yes,1.0,81517
yes,2.0,32436
yes,3.0,19095
yes,4.0,14973
yes,5.0,11297
yes,6.0,9878
yes,7.0,8839
yes,8.0,5992
yes,9.0,5718
yes,10.0,5254


Drop columns we no longer need.

In [12]:
data = data.drop(columns = ['sQTL_ref_allele','sQTL_alt_allele','_merge'])
data.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_allele_origin,Vernot_introgressed_AF,Browning_introgressed,Browning_allele_origin,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL
0,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
1,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
2,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
3,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.683,0.0,0.0,0.0,0.0,0.0,48,-14,-46,-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
4,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.197,0.0,0.0,0.0,0.0,0.0,3,28,-38,41,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no
5,chr1,862072,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,9.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,low-confidence ancient,,no,low-confidence ancient,,0.0,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.197,0.0,0.0,0.0,0.0,0.0,-6,-50,24,-48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no
6,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
7,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-1.042,0.0,0.0,0.0,0.0,0.0,7,14,-15,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
8,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes
9,chr1,862093,T,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3484.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.315789,no,,,,,,,,,,,,,,,ancient,,no,ancient,,0.68,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-1.042,0.0,0.0,0.0,0.0,0.0,-34,48,49,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes


Pandas panicked and created lots of duplicates. Let's get our N back to where it should be: 1,607,350.

In [13]:
data = data.drop_duplicates(['chrom','pos','ref_allele','alt_allele','annotation'])
len(data)

1607350

Finally, while many sQTLs appear to also be variant in 1KG, there appear to be some novel variants. Let's reclassify these from 'archaic-specific' to 'low-confidence ancient' for both Vernot and Browning.

In [14]:
def new_Vernot_allele_origin(data):
    if (data['Vernot_allele_origin'] == 'ancient'):
        return 'ancient'
    elif (data['Vernot_allele_origin'] == 'introgressed'):
        return 'introgressed'
    elif (data['Vernot_allele_origin'] == 'low-confidence ancient'):
        return 'low-confidence ancient'
    elif (data['Vernot_allele_origin'] == 'archaic-specific') & (data['sQTL'] == 'yes'):
        return 'low-confidence ancient'
    elif (data['Vernot_allele_origin'] == 'archaic-specific') & (data['sQTL'] == 'no'):
        return 'archaic-specific'

data['new_Vernot_allele_origin'] = data.apply(new_Vernot_allele_origin, axis = 1)

In [15]:
def new_Browning_allele_origin(data):
    if (data['Browning_allele_origin'] == 'ancient'):
        return 'ancient'
    elif (data['Browning_allele_origin'] == 'introgressed'):
        return 'introgressed'
    elif (data['Browning_allele_origin'] == 'low-confidence ancient'):
        return 'low-confidence ancient'
    elif (data['Browning_allele_origin'] == 'archaic-specific') & (data['sQTL'] == 'yes'):
        return 'low-confidence ancient'
    elif (data['Browning_allele_origin'] == 'archaic-specific') & (data['sQTL'] == 'no'):
        return 'archaic-specific'

data['new_Browning_allele_origin'] = data.apply(new_Browning_allele_origin, axis = 1)

Drop the old columns.

In [16]:
data = data.drop(columns = ['Vernot_allele_origin'])
data.rename(columns={'new_Vernot_allele_origin': 'Vernot_allele_origin'}, inplace=True)

In [17]:
data = data.drop(columns = ['Browning_allele_origin'])
data.rename(columns={'new_Browning_allele_origin': 'Browning_allele_origin'}, inplace=True)

Save this dataframe.

In [18]:
data.to_csv('archaic_data_with_constraint_moderns_introgression_sQTLs.txt', sep = '\t', header = True, index = False)

Start here with the full dataframe. Be sure to load it first! <a class = 'anchor' id = 'startwithdataframe'></a>

# Spliceosome Variants <a class = 'anchor' id = 'spliceosomevariants'></a>

Let's take a quick detour to examine variants in components of the major spliceosome. We'll start with genes identified in the HGNC: https://www.genenames.org/data/genegroup/#!/group/1518.

In [19]:
spliceosome = pd.read_csv("../annotations/major_spliceosome_genes.txt", sep='\t', header=0)
spliceosome.head(10)

Unnamed: 0,HGNC_ID,Approved_symbol,Approved_name,Status,Locus_type,Previous_symbols,Alias_symbols,Chromosome,NCBI_Gene_ID,Ensembl_gene_ID,Vega_gene_ID,Group_ID,Group_name
0,HGNC:17040,CASC3,CASC3 exon junction complex subunit,Approved,gene with protein product,,"MLN51, BTZ",17q21.1,22794,ENSG00000108349,OTTHUMG00000133323,1238,Exon junction complex
1,HGNC:18683,EIF4A3,eukaryotic translation initiation factor 4A3,Approved,gene with protein product,DDX48,"KIAA0111, EIF4AIII, Fal1",17q25.3,9775,ENSG00000141543,OTTHUMG00000177538,1238,Exon junction complex
2,HGNC:6815,MAGOH,"mago homolog, exon junction complex subunit",Approved,gene with protein product,,"MAGOHA, MAGOH1",1p32.3,4116,ENSG00000162385,OTTHUMG00000008932,1238,Exon junction complex
3,HGNC:25504,MAGOHB,"mago homolog B, exon junction complex subunit",Approved,gene with protein product,,"FLJ10292, MGN2",12p13.2,55110,ENSG00000111196,OTTHUMG00000168407,1238,Exon junction complex
4,HGNC:9905,RBM8A,RNA binding motif protein 8A,Approved,gene with protein product,RBM8,"ZNRP, BOV-1A, BOV-1B, BOV-1C, RBM8B, Y14",1q21.1,9939,ENSG00000265241,OTTHUMG00000013736,1238,Exon junction complex
5,HGNC:20472,LSM1,"LSM1 homolog, mRNA degradation associated",Approved,gene with protein product,,"CASM, YJL124C",8p11.23,27257,ENSG00000175324,OTTHUMG00000164051,1505,LSm proteins
6,HGNC:17562,LSM10,"LSM10, U7 small nuclear RNA associated",Approved,gene with protein product,,MGC15749,1p34.3,84967,ENSG00000181817,OTTHUMG00000008140,1505,LSm proteins
7,HGNC:30860,LSM11,"LSM11, U7 small nuclear RNA associated",Approved,gene with protein product,,FLJ38273,5q33.3,134353,ENSG00000155858,OTTHUMG00000130255,1505,LSm proteins
8,HGNC:26407,LSM12,LSM12 homolog,Approved,gene with protein product,,FLJ30656,17q21.31,124801,ENSG00000161654,OTTHUMG00000181804,1505,LSm proteins
9,HGNC:24489,LSM14A,LSM14A mRNA processing body assembly factor,Approved,gene with protein product,"C19orf13, FAM61A","DKFZP434D1335, RAP55A, RAP55",19q13.11,26065,ENSG00000257103,OTTHUMG00000180491,1505,LSm proteins


Let's get just the gene names.

In [20]:
spliceosome_genes = spliceosome['Approved_symbol']
spliceosome_genes

0         CASC3
1        EIF4A3
2         MAGOH
3        MAGOHB
4         RBM8A
         ...   
241       PRPF8
242     RNU5A-1
243    SNRNP200
244     SNRNP40
245      TXNL4A
Name: Approved_symbol, Length: 246, dtype: object

Now to subset all our variants for just those in spliceosome genes.

In [21]:
spliceosome_variants = data[data['annotation'].isin(spliceosome_genes)]
spliceosome_variants.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
21533,chr1,25549345,G,C,G,derived,snv,1/1,1/1,0/1,1/1,True,True,True,True,Shared,yes,37.0,5096.0,0.01,0.0,0.0,0.03,0.0,0.0,0.02924,no,,,,,,,,,,,,,,,,no,,0.01,SYF2,0.93809,0.25699,0.42189,2.0202,0.65,0.0,0.0,0.0,0.0,0.0,33,-6,48,-5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
21534,chr1,25549970,G,C,g,derived,snv,0/0,0/0,0/1,0/0,False,False,True,False,Denisovan,yes,242.0,5096.0,0.05,0.0,0.0,0.17,0.01,0.0,0.173489,no,,,,,,,,,,,,,,,,no,,0.05,SYF2,0.93809,0.25699,0.42189,2.0202,-0.187,0.0,0.0,0.0,0.0,0.0,-48,-30,38,44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
21535,chr1,25550063,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,23.0,5096.0,0.0,0.0,0.0,0.02,0.0,0.0,0.018519,no,,,,,,,,,,,,,,,,no,,0.0,SYF2,0.93809,0.25699,0.42189,2.0202,-0.797,0.0,0.0,0.0,0.0,0.0,28,-30,28,-41,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
21536,chr1,25551415,G,A,G,derived,snv,0/0,0/0,0/1,0/0,False,False,True,False,Denisovan,yes,12.0,5096.0,0.0,0.0,0.0,0.01,0.0,0.0,0.010721,no,,,,,,,,,,,,,,,,no,,0.0,SYF2,0.93809,0.25699,0.42189,2.0202,-0.467,0.0,0.0,0.0,0.0,0.0,-9,-23,0,-31,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
21537,chr1,25554032,A,G,A,derived,snv,1/1,1/1,0/1,1/1,True,True,True,True,Shared,yes,37.0,5096.0,0.01,0.0,0.0,0.03,0.0,0.0,0.02924,no,,,,,,,,,,,,,,,,no,,0.01,SYF2,0.93809,0.25699,0.42189,2.0202,0.455,0.01,0.0,0.0,0.0,0.01,44,-9,-9,-33,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
21538,chr1,25554269,T,A,T,derived,snv,0/0,0/0,0/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SYF2,0.93809,0.25699,0.42189,2.0202,-0.25,0.0,0.0,0.0,0.0,0.0,-47,44,31,44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
21539,chr1,25555672,C,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,923.0,5096.0,0.18,0.0,0.01,0.65,0.07,0.0,0.663743,no,,,,,,,,,,,,,,,,no,,0.18,SYF2,0.93809,0.25699,0.42189,2.0202,-0.269,0.02,0.0,0.0,0.0,0.02,-29,-50,6,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
21540,chr1,25556806,G,T,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SYF2,0.93809,0.25699,0.42189,2.0202,0.4,0.0,0.0,0.0,0.0,0.0,-4,23,-4,-46,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
21541,chr1,25557165,G,A,g,derived,snv,0/0,0/0,0/1,0/0,False,False,True,False,Denisovan,yes,180.0,5096.0,0.04,0.0,0.0,0.13,0.01,0.0,0.12963,no,,,,,,,,,,,,,,,,no,,0.04,SYF2,0.93809,0.25699,0.42189,2.0202,-0.279,0.0,0.0,0.0,0.02,0.02,-5,-27,-27,-47,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
21542,chr1,25557681,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,470.0,5096.0,0.09,0.0,0.0,0.33,0.03,0.0,0.340156,no,,,,,,,,,,,,,,,,no,,0.09,SYF2,0.93809,0.25699,0.42189,2.0202,-0.971,0.0,0.0,0.0,0.0,0.0,-1,-11,-4,-15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient


How many variants did we get?

In [22]:
len(spliceosome_variants.groupby(['chrom','pos']))

4571

How many variants are uniquely archaic (i.e., absent from 1KG)?

In [23]:
spliceosome_variants = spliceosome_variants[spliceosome_variants['present_in_1KG'] == 'no']

In [24]:
len(spliceosome_variants.groupby(['chrom','pos']))

1866

Now let's format the data to run it through Ensembl's Variant Effect Predictor.

In [25]:
spliceosome_variants_for_VEP = spliceosome_variants[['chrom','pos','ref_allele','alt_allele']].copy()
spliceosome_variants_for_VEP['null'] = '.'
spliceosome_variants_for_VEP['chrom'] = spliceosome_variants_for_VEP['chrom'].str.replace('chr', '')
spliceosome_variants_for_VEP = spliceosome_variants_for_VEP[['chrom','pos','null','ref_allele','alt_allele','null','null','null']]
spliceosome_variants_for_VEP.head(10)

Unnamed: 0,chrom,pos,null,ref_allele,alt_allele,null.1,null.2,null.3
21538,1,25554269,.,T,A,.,.,.
21540,1,25556806,.,G,T,.,.,.
24704,1,31734327,.,T,C,.,.,.
24705,1,31735461,.,G,A,.,.,.
24712,1,31741832,.,C,T,.,.,.
24714,1,31743354,.,C,T,.,.,.
24715,1,31743490,.,T,C,.,.,.
24718,1,31745979,.,T,G,.,.,.
24723,1,31751687,.,G,A,.,.,.
24726,1,31754447,.,C,T,.,.,.


Save the file.

In [26]:
spliceosome_variants_for_VEP.to_csv('../spliceosome/all_spliceosome_variants_for_VEP.txt', sep="\t", header = False, index = False)

Run this file through the Ensembl Variant Effect Predictor and upload the results to the directory. 

In [27]:
all_spliceosome_variants = pd.read_csv("../spliceosome/all_spliceosome_variants.vep", sep='\t', header=0)
all_spliceosome_variants.head(10)

Unnamed: 0,#Uploaded_variation,Location,Allele,Consequence,IMPACT,SYMBOL,Gene,Feature_type,Feature,BIOTYPE,EXON,INTRON,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,DISTANCE,STRAND,FLAGS,SYMBOL_SOURCE,HGNC_ID,MANE_SELECT,MANE_PLUS_CLINICAL,TSL,APPRIS,SIFT,PolyPhen,AF,CLIN_SIG,SOMATIC,PHENO,PUBMED,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,TRANSCRIPTION_FACTORS
0,.,1:25554269-25554269,A,intron_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000236273.4,protein_coding,-,4/6,-,-,-,-,-,-,-,-,-,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
1,.,1:25554269-25554269,A,intron_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000354361.3,protein_coding,-,3/5,-,-,-,-,-,-,-,-,-,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
2,.,1:25554269-25554269,A,downstream_gene_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000474160.1,processed_transcript,-,-,-,-,-,-,-,-,-,-,1237,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
3,.,1:25554269-25554269,A,downstream_gene_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000476231.1,processed_transcript,-,-,-,-,-,-,-,-,-,-,2531,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
4,.,1:25556806-25556806,T,intron_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000236273.4,protein_coding,-,2/6,-,-,-,-,-,-,-,rs1032515086,-,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
5,.,1:25556806-25556806,T,intron_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000354361.3,protein_coding,-,2/5,-,-,-,-,-,-,-,rs1032515086,-,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
6,.,1:25556806-25556806,T,"intron_variant,non_coding_transcript_variant",MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000474160.1,processed_transcript,-,1/1,-,-,-,-,-,-,-,rs1032515086,-,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
7,.,1:25556806-25556806,T,non_coding_transcript_exon_variant,MODIFIER,SYF2,ENSG00000117614,Transcript,ENST00000476231.1,processed_transcript,2/2,-,-,-,1956,-,-,-,-,rs1032515086,-,-1,-,HGNC,19824,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
8,.,1:31734327-31734327,C,intron_variant,MODIFIER,SNRNP40,ENSG00000060688,Transcript,ENST00000263694.4,protein_coding,-,9/9,-,-,-,-,-,-,-,-,-,-1,-,HGNC,30857,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-
9,.,1:31734327-31734327,C,intron_variant,MODIFIER,SNRNP40,ENSG00000060688,Transcript,ENST00000373720.3,protein_coding,-,3/3,-,-,-,-,-,-,-,-,-,-1,-,HGNC,30857,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-


Subset to variants with non-synonymous effects.

In [28]:
nonsynonymous_all_spliceosome_variants = all_spliceosome_variants[all_spliceosome_variants['Consequence'] == 'missense_variant']
nonsynonymous_all_spliceosome_variants.head(10)

Unnamed: 0,#Uploaded_variation,Location,Allele,Consequence,IMPACT,SYMBOL,Gene,Feature_type,Feature,BIOTYPE,EXON,INTRON,HGVSc,HGVSp,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,DISTANCE,STRAND,FLAGS,SYMBOL_SOURCE,HGNC_ID,MANE_SELECT,MANE_PLUS_CLINICAL,TSL,APPRIS,SIFT,PolyPhen,AF,CLIN_SIG,SOMATIC,PHENO,PUBMED,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,TRANSCRIPTION_FACTORS
582,.,1:150297443-150297443,G,missense_variant,MODERATE,PRPF3,ENSG00000117360,Transcript,ENST00000324862.6,protein_coding,2/16,-,-,-,208,43,15,I/V,Ata/Gta,rs1395340408,-,1,-,HGNC,17348,-,-,-,-,tolerated(0.67),benign(0.001),-,-,-,-,-,-,-,-,-,-
583,.,1:150297443-150297443,G,missense_variant,MODERATE,PRPF3,ENSG00000117360,Transcript,ENST00000414970.2,protein_coding,2/15,-,-,-,131,43,15,I/V,Ata/Gta,rs1395340408,-,1,-,HGNC,17348,-,-,-,-,tolerated(0.75),benign(0),-,-,-,-,-,-,-,-,-,-
1680,.,11:116633557-116633557,G,missense_variant,MODERATE,BUD13,ENSG00000137656,Transcript,ENST00000260210.4,protein_coding,4/10,-,-,-,772,748,250,D/H,Gac/Cac,rs1425200283,-,-1,-,HGNC,28199,-,-,-,-,deleterious(0),benign(0.341),-,-,-,-,-,-,-,-,-,-
2042,.,13:21714974-21714974,G,missense_variant,MODERATE,SAP18,ENSG00000150459,Transcript,ENST00000450573.1,protein_coding,1/3,-,-,-,62,13,5,F/V,Ttc/Gtc,-,-,1,cds_end_NF,HGNC,10530,-,-,-,-,tolerated_low_confidence(1),unknown(0),-,-,-,-,-,-,-,-,-,-
3889,.,15:35219290-35219290,C,missense_variant,MODERATE,AQR,ENSG00000021776,Transcript,ENST00000156471.5,protein_coding,13/35,-,-,-,1290,1064,355,N/S,aAt/aGt,rs767183682,-,-1,-,HGNC,29513,-,-,-,-,tolerated(0.09),benign(0.062),-,-,-,-,-,-,-,-,-,-
4175,.,16:2810331-2810331,T,missense_variant,MODERATE,SRRM2,ENSG00000167978,Transcript,ENST00000301740.8,protein_coding,10/15,-,-,-,1412,863,288,T/I,aCa/aTa,rs747144756,-,1,-,HGNC,16639,-,-,-,-,deleterious_low_confidence(0.04),benign(0.006),-,-,-,-,-,-,-,-,-,-
4179,.,16:2810331-2810331,T,missense_variant,MODERATE,SRRM2,ENSG00000167978,Transcript,ENST00000571378.1,protein_coding,9/10,-,-,-,760,575,192,T/I,aCa/aTa,rs747144756,-,1,cds_end_NF,HGNC,16639,-,-,-,-,tolerated_low_confidence(0.18),benign(0.006),-,-,-,-,-,-,-,-,-,-
4185,.,16:2810331-2810331,T,missense_variant,MODERATE,SRRM2,ENSG00000167978,Transcript,ENST00000575009.1,protein_coding,9/10,-,-,-,830,575,192,T/I,aCa/aTa,rs747144756,-,1,cds_end_NF,HGNC,16639,-,-,-,-,tolerated_low_confidence(0.22),benign(0.006),-,-,-,-,-,-,-,-,-,-
4192,.,16:2810331-2810331,T,missense_variant,MODERATE,SRRM2,ENSG00000167978,Transcript,ENST00000576924.1,protein_coding,10/11,-,-,-,1123,863,288,T/I,aCa/aTa,rs747144756,-,1,cds_end_NF,HGNC,16639,-,-,-,-,tolerated_low_confidence(0.13),benign(0.006),-,-,-,-,-,-,-,-,-,-
4229,.,16:2813610-2813610,C,missense_variant,MODERATE,SRRM2,ENSG00000167978,Transcript,ENST00000301740.8,protein_coding,11/15,-,-,-,3630,3081,1027,Q/H,caA/caC,rs530922815,-,1,-,HGNC,16639,-,-,-,-,deleterious_low_confidence(0.04),benign(0.365),-,-,-,-,-,-,-,-,-,-


How many transcripts are impacted?

In [29]:
len(nonsynonymous_all_spliceosome_variants)

36

How many variants does this represent?

In [30]:
nonsynonymous_all_spliceosome_variants.groupby(by=['Location', 'Allele', 'Gene']).size().reset_index(name='transcript_count')

Unnamed: 0,Location,Allele,Gene,transcript_count
0,11:116633557-116633557,G,ENSG00000137656,1
1,13:21714974-21714974,G,ENSG00000150459,1
2,15:35219290-35219290,C,ENSG00000021776,1
3,16:2810331-2810331,T,ENSG00000167978,4
4,16:2813610-2813610,C,ENSG00000167978,1
5,16:2817499-2817499,G,ENSG00000167978,1
6,16:2819197-2819197,C,ENSG00000167978,1
7,17:1554120-1554120,T,ENSG00000174231,1
8,17:1557188-1557188,A,ENSG00000174231,2
9,17:36963046-36963046,C,ENSG00000108296,2


Now let's look at the potential effects starting with SIFT designations.

In [31]:
nonsynonymous_all_spliceosome_variants_SIFT = nonsynonymous_all_spliceosome_variants.groupby(by=['Location','SIFT']).size().reset_index(name='SIFT_count')
nonsynonymous_all_spliceosome_variants_SIFT

Unnamed: 0,Location,SIFT,SIFT_count
0,11:116633557-116633557,deleterious(0),1
1,13:21714974-21714974,tolerated_low_confidence(1),1
2,15:35219290-35219290,tolerated(0.09),1
3,16:2810331-2810331,deleterious_low_confidence(0.04),1
4,16:2810331-2810331,tolerated_low_confidence(0.13),1
5,16:2810331-2810331,tolerated_low_confidence(0.18),1
6,16:2810331-2810331,tolerated_low_confidence(0.22),1
7,16:2813610-2813610,deleterious_low_confidence(0.04),1
8,16:2817499-2817499,deleterious_low_confidence(0),1
9,16:2819197-2819197,deleterious_low_confidence(0),1


In [32]:
SIFT_deleterious = nonsynonymous_all_spliceosome_variants_SIFT[nonsynonymous_all_spliceosome_variants_SIFT['SIFT'].str.contains('deleterious')]
SIFT_deleterious

Unnamed: 0,Location,SIFT,SIFT_count
0,11:116633557-116633557,deleterious(0),1
3,16:2810331-2810331,deleterious_low_confidence(0.04),1
7,16:2813610-2813610,deleterious_low_confidence(0.04),1
8,16:2817499-2817499,deleterious_low_confidence(0),1
9,16:2819197-2819197,deleterious_low_confidence(0),1
10,17:1554120-1554120,deleterious_low_confidence(0),1
14,19:36124120-36124120,deleterious_low_confidence(0),5
16,19:54663336-54663336,deleterious(0.02),1


And PolyPhen.

In [33]:
nonsynonymous_all_spliceosome_variants_PolyPhen = nonsynonymous_all_spliceosome_variants.groupby(by=['Location','PolyPhen']).size().reset_index(name='PolyPhen_count')
nonsynonymous_all_spliceosome_variants_PolyPhen

Unnamed: 0,Location,PolyPhen,PolyPhen_count
0,11:116633557-116633557,benign(0.341),1
1,13:21714974-21714974,unknown(0),1
2,15:35219290-35219290,benign(0.062),1
3,16:2810331-2810331,benign(0.006),4
4,16:2813610-2813610,benign(0.365),1
5,16:2817499-2817499,benign(0.007),1
6,16:2819197-2819197,benign(0.106),1
7,17:1554120-1554120,unknown(0),1
8,17:1557188-1557188,benign(0.063),2
9,17:36963046-36963046,benign(0.034),1


In [34]:
PolyPhen_damaging = nonsynonymous_all_spliceosome_variants_PolyPhen[nonsynonymous_all_spliceosome_variants_PolyPhen['PolyPhen'].str.contains('damaging')]
PolyPhen_damaging

Unnamed: 0,Location,PolyPhen,PolyPhen_count
12,19:36124120-36124120,probably_damaging(0.993),1
13,19:36124120-36124120,probably_damaging(0.997),5
14,19:54663336-54663336,possibly_damaging(0.592),1
19,20:37632435-37632435,possibly_damaging(0.454),1
20,20:37632435-37632435,possibly_damaging(0.506),1
21,5:150080190-150080190,possibly_damaging(0.711),1
23,5:176940736-176940736,possibly_damaging(0.451),1


Do SIFT and PolyPhen agree on any positions?

In [35]:
pd.merge(SIFT_deleterious, PolyPhen_damaging, on = ['Location'])

Unnamed: 0,Location,SIFT,SIFT_count,PolyPhen,PolyPhen_count
0,19:36124120-36124120,deleterious_low_confidence(0),5,probably_damaging(0.993),1
1,19:36124120-36124120,deleterious_low_confidence(0),5,probably_damaging(0.997),5
2,19:54663336-54663336,deleterious(0.02),1,possibly_damaging(0.592),1


Now let's save the dataframe of non-synonymous variants for any downstream analyses.

In [36]:
nonsynonymous_all_spliceosome_variants.to_csv('../spliceosome/nonsynonymous_all_spliceosome_variants.txt', sep="\t", header = True, index = False)

# Data Description <a class = 'anchor' id = 'datadescription'></a>

Let's get a handle on the dataset as a whole. Then we will dive into the SAVs.

Count the number of positions.

In [37]:
len(data.groupby(['chrom','pos']))

1567894

Count the number of variants.

In [38]:
len(data)

1607350

## Multiple Annotations <a class = 'anchor' id = 'multipleannotations'></a>

Let's assess the degree of multiple annotation in the data.

In [39]:
multi_annotation = data.groupby(by=['chrom','pos','ref_allele','altai_gt','chagyrskaya_gt','denisovan_gt','vindija_gt','alt_allele']).size().reset_index(name='annotation_count')
multi_annotation.groupby('annotation_count').size().to_frame('count')

Unnamed: 0_level_0,count
annotation_count,Unnamed: 1_level_1
1,1537451
2,30448
3,1154
4,75
5,36
6,15
7,25
8,22
9,36
10,20


## Multiple Alleles <a class = 'anchor' id = 'multiplealleles'></a>

Let's assess the degree of multi-allelism in the data (excluding positions with multiple annotations). 

In [40]:
single_annotation_data = data.drop_duplicates(subset=['chrom','pos','ref_allele','altai_gt','chagyrskaya_gt','denisovan_gt','vindija_gt','alt_allele','variant_type'], keep='first').reset_index()
len(single_annotation_data)

1569556

In [41]:
multi_allelic = single_annotation_data.groupby(['chrom','pos',]).size().reset_index(name='allele_count')
multi_allelic.head(10)

Unnamed: 0,chrom,pos,allele_count
0,chr1,861808,1
1,chr1,862072,1
2,chr1,862093,1
3,chr1,862124,1
4,chr1,862383,1
5,chr1,862389,1
6,chr1,863124,1
7,chr1,863843,1
8,chr1,863863,1
9,chr1,863978,1


In [42]:
multi_allelic.groupby('allele_count').size().to_frame('allele count')

Unnamed: 0_level_0,allele count
allele_count,Unnamed: 1_level_1
1,1566233
2,1660
3,1


## Genotypes <a class = 'anchor' id = 'genotypes'></a>

Now let's take a look at the distribution of genotypes across individuals. 

In [43]:
data.groupby(['altai_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
altai_gt,Unnamed: 1_level_1
./.,6226
0/0,591371
0/1,126120
1/0,320
1/1,883313


In [44]:
data.groupby(['chagyrskaya_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
chagyrskaya_gt,Unnamed: 1_level_1
./.,37156
0/0,583787
0/1,97932
1/0,144
1/1,888331


In [45]:
data.groupby(['denisovan_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
denisovan_gt,Unnamed: 1_level_1
./.,21574
0/0,546792
0/1,139522
1/0,252
1/1,899210


In [46]:
data.groupby(['vindija_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
vindija_gt,Unnamed: 1_level_1
./.,29672
0/0,570175
0/1,119718
1/0,200
1/1,887585


## Annotations <a class = 'anchor' id = 'annotations'></a>

How many annotations are represented in the total data?

In [47]:
len(data.groupby(['annotation']))

17631

What is the distribution of these annotations?

In [48]:
n_annotations = data.groupby(['annotation']).size().to_frame('count')
n_annotations.groupby('count').size().to_frame('n')

Unnamed: 0_level_0,n
count,Unnamed: 1_level_1
1,577
2,546
3,473
4,475
5,427
...,...
3868,1
3956,1
4747,1
5413,1


## Delta Thresholds <a class = 'anchor' id = 'deltathresholds'></a>

Let's do a quick subset of the data for variants that include one delta >= 0.2 and another for deltas >= 0.5.

In [49]:
data_2 = data[data['delta_max']>=0.2]
data_2.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
38,chr1,864726,T,A,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4859.0,5096.0,0.95,1.0,1.0,0.83,0.99,1.0,0.812865,no,,,,,,,,,,,,,,,,no,,0.95,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.398,0.0,0.0,0.02,0.22,0.22,17,-2,43,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
79,chr1,875708,C,G,T,derived,snv,1/1,./.,./.,./.,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.847,0.0,0.0,0.29,0.01,0.29,-43,33,0,12,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
145,chr1,909768,A,G,-,.,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,5094.0,5096.0,1.0,1.0,1.0,1.0,1.0,1.0,0.998051,no,,,,,,,,,,,,,,,,no,,1.0,PLEKHN1,1.2593,-1.7988,0.80467,1.0089,-1.259,0.0,0.29,0.13,0.03,0.29,7,26,-24,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
170,chr1,956622,C,T,C,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,4.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002924,no,,,,,,,,,,,,,,,,no,,0.0,AGRN,0.98257,0.22552,0.31914,6.0143,-0.542,0.0,0.0,0.24,0.02,0.24,-2,50,-2,40,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
225,chr1,975133,T,A,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4557.0,5096.0,0.89,1.0,0.92,0.72,0.92,0.97,0.72807,no,,,,,,,,,,,,,,,,no,,0.89,AGRN,0.98257,0.22552,0.31914,6.0143,0.364,0.0,0.0,0.22,0.08,0.22,-3,14,-3,14,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,15.0,yes,ancient,ancient
233,chr1,980460,G,A,G,derived,snv,1/1,./.,1/1,1/1,True,False,True,True,Other,yes,4642.0,5096.0,0.91,1.0,0.93,0.77,0.93,0.97,0.781676,no,,,,,,,,,,,,,,,,no,,0.91,AGRN,0.98257,0.22552,0.31914,6.0143,0.561,0.0,0.0,0.03,0.43,0.43,30,-33,30,0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,14.0,yes,ancient,ancient
239,chr1,982099,T,C,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,AGRN,0.98257,0.22552,0.31914,6.0143,-0.375,0.0,0.0,0.04,0.48,0.48,-48,-6,16,-6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
345,chr1,1117486,A,G,a,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,1493.0,5096.0,0.29,0.15,0.13,0.54,0.21,0.32,0.547758,no,,,,,,,,,,,,,,,,no,,0.29,TTLL10,0.9539,0.32998,0.77894,1.038,-2.522,0.0,0.0,0.27,0.0,0.27,-1,33,-1,-48,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,5.0,yes,ancient,ancient
392,chr1,1153630,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SDF4,0.90199,0.55251,0.3872,2.2354,-1.558,0.04,0.0,0.38,0.0,0.38,22,-45,2,23,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
424,chr1,1166887,T,C,T,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,898.0,5096.0,0.18,0.17,0.13,0.26,0.11,0.17,0.269981,no,,,,,,,,,,,,,,,,no,,0.18,SDF4,0.90199,0.55251,0.3872,2.2354,0.26,0.0,0.0,0.52,0.0,0.52,-9,-2,0,-4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0,yes,ancient,ancient


In [50]:
len(data_2)

5950

In [51]:
len(data_2.groupby(['chrom','pos']))

5811

Now let's examine multiple annotations in these data.

In [52]:
SAV_2_multiple_annotation = data_2.groupby(['chrom','pos',]).size().reset_index(name='SAV_multiple_annotation')
SAV_2_multiple_annotation.groupby('SAV_multiple_annotation').size().to_frame('count')

Unnamed: 0_level_0,count
SAV_multiple_annotation,Unnamed: 1_level_1
1,5679
2,129
3,2
7,1


Now for delta >= 0.5.

In [53]:
data_5 = data[data['delta_max']>=0.5]
data_5.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
424,chr1,1166887,T,C,T,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,898.0,5096.0,0.18,0.17,0.13,0.26,0.11,0.17,0.269981,no,,,,,,,,,,,,,,,,no,,0.18,SDF4,0.90199,0.55251,0.3872,2.2354,0.26,0.0,0.0,0.52,0.0,0.52,-9,-2,0,-4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0,yes,ancient,ancient
1912,chr1,2524237,A,C,C,ancestral,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,1.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,no,,0.0,MMEL1,0.98394,0.12602,0.76824,1.5689,-0.92,0.01,0.0,0.0,0.56,0.56,-42,10,35,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
3180,chr1,3445088,C,A,C,derived,snv,1/1,0/0,0/0,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,MEGF6,0.95847,0.45803,0.73856,2.2728,0.205,0.0,0.0,0.0,0.73,0.73,-4,44,2,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
4268,chr1,5935162,A,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,NPHP4,1.0229,-0.24249,0.78155,1.6669,-0.557,1.0,0.91,0.0,0.0,1.0,-2,-8,-43,-44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
4658,chr1,6207798,A,G,A,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,2126.0,5096.0,0.42,0.64,0.23,0.54,0.17,0.39,0.550682,no,,,,,,,,,,,,,,,,no,,0.42,CHD5,0.56491,5.3168,0.089818,8.4428,-0.331,0.0,0.0,0.06,0.61,0.61,-31,36,26,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
5480,chr1,6930684,G,A,G,derived,snv,0/0,0/0,0/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CAMTA1,0.71229,3.2619,0.091344,7.371,0.555,0.64,0.0,0.01,0.0,0.64,2,-9,23,-9,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7161,chr1,7731131,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CAMTA1,0.71229,3.2619,0.091344,7.371,0.655,0.0,0.0,0.02,0.71,0.71,48,-5,-34,-5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7504,chr1,8395370,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SLC45A1,0.81558,1.4658,0.45953,2.5593,-0.441,0.01,0.0,0.67,0.03,0.67,-31,33,-1,-32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
10044,chr1,11009679,G,A,G,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,352.0,5096.0,0.07,0.0,0.04,0.2,0.01,0.02,0.207602,no,,,,,,,,,,,,,,,,no,,0.07,C1orf127,0.91367,0.66009,0.7232,1.3822,-0.354,0.0,0.0,0.83,0.0,0.83,2,-44,2,-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,yes,low-confidence ancient,low-confidence ancient
14279,chr1,16373634,A,G,a,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CLCNKB,1.0791,-0.56265,0.93478,0.36979,0.26,0.0,0.0,0.75,0.0,0.75,-1,28,-1,-50,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


In [54]:
len(data_5)

1049

In [55]:
len(data_5.groupby(['chrom','pos']))

1017

In [56]:
SAV_5_multiple_annotation = data_5.groupby(['chrom','pos',]).size().reset_index(name='SAV_multiple_annotation')
SAV_5_multiple_annotation.groupby('SAV_multiple_annotation').size().to_frame('count')

Unnamed: 0_level_0,count
SAV_multiple_annotation,Unnamed: 1_level_1
1,986
2,30
3,1


Now let's count the number of each class.

In [57]:
len(data_2[(data_2['ag_delta']>=0.2)])

1511

In [58]:
len(data_2[(data_2['al_delta']>=0.2)])

1377

In [59]:
len(data_2[(data_2['dg_delta']>=0.2)])

1894

In [60]:
len(data_2[(data_2['dl_delta']>=0.2)])

1647

In [61]:
len(data_5[(data_5['ag_delta']>=0.5)])

180

In [62]:
len(data_5[(data_5['al_delta']>=0.5)])

191

In [63]:
len(data_5[(data_5['dg_delta']>=0.5)])

390

In [64]:
len(data_5[(data_5['dl_delta']>=0.5)])

349

## Variant Distribution <a class = 'anchor' id = 'variantdistribution'></a>

In [65]:
data.groupby(['distribution']).size().to_frame('distribution')

Unnamed: 0_level_0,distribution
distribution,Unnamed: 1_level_1
Altai,81916
Chagyrskaya,53675
Denisovan,411492
Late Neanderthal,44651
Neanderthal,266965
Other,104455
Shared,573197
Vindija,70999


In [66]:
data_2.groupby(['distribution']).size().to_frame('distribution')

Unnamed: 0_level_0,distribution
distribution,Unnamed: 1_level_1
Altai,399
Chagyrskaya,218
Denisovan,1492
Late Neanderthal,172
Neanderthal,973
Other,456
Shared,1933
Vindija,307


In [67]:
data_5.groupby(['distribution']).size().to_frame('distribution')

Unnamed: 0_level_0,distribution
distribution,Unnamed: 1_level_1
Altai,75
Chagyrskaya,35
Denisovan,254
Late Neanderthal,22
Neanderthal,178
Other,84
Shared,328
Vindija,73


## Delta Correlations <a class = 'anchor' id = 'deltacorrelations'></a>

Let's examine the correlations between deltas. There are a lot of zeroes so we'll first exclude rows with all zeroes as this will bias our analysis.

In [68]:
data_nonzero = data[(data['ag_delta']>0) | (data['al_delta']>0) | (data['dg_delta']>0) | (data['dl_delta']>0)]
data_nonzero.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
0,chr1,861808,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,3479.0,5096.0,0.68,0.55,0.97,0.34,0.83,0.88,0.31384,no,,,,,,,,,,,,,,,,no,,0.68,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.683,0.0,0.0,0.01,0.0,0.01,-29,-27,24,-20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,yes,ancient,ancient
33,chr1,863978,G,A,g,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,497.0,5096.0,0.1,0.15,0.01,0.18,0.11,0.03,0.180312,no,,,,,,,,,,,,,,,,no,,0.1,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.197,0.0,0.0,0.0,0.01,0.01,-15,4,2,-14,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
34,chr1,864678,C,T,C,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.486,0.0,0.0,0.03,0.0,0.03,-48,26,46,-14,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
35,chr1,864678,C,T,C,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.486,0.01,0.02,0.0,0.0,0.02,-2,44,3,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
36,chr1,864701,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,10.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,yes,1.0,0.0,SAMD11,1.5082,-3.4361,0.89656,0.47484,-0.423,0.0,0.0,0.08,0.02,0.08,1,3,23,-10,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,introgressed
37,chr1,864701,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,10.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,yes,1.0,0.0,AL645608.1,1.2217,-0.64548,0.73515,0.49579,-0.423,0.04,0.0,0.0,0.0,0.04,21,-18,-28,8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,introgressed
38,chr1,864726,T,A,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4859.0,5096.0,0.95,1.0,1.0,0.83,0.99,1.0,0.812865,no,,,,,,,,,,,,,,,,no,,0.95,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.398,0.0,0.0,0.02,0.22,0.22,17,-2,43,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
39,chr1,864726,T,A,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4859.0,5096.0,0.95,1.0,1.0,0.83,0.99,1.0,0.812865,no,,,,,,,,,,,,,,,,no,,0.95,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.398,0.0,0.03,0.0,0.0,0.03,1,-4,36,-43,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
40,chr1,864898,A,G,A,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,10.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,yes,1.0,0.0,SAMD11,1.5082,-3.4361,0.89656,0.47484,0.454,0.0,0.0,0.01,0.0,0.01,-1,48,-1,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,introgressed
41,chr1,864898,A,G,A,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,10.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,yes,1.0,0.0,AL645608.1,1.2217,-0.64548,0.73515,0.49579,0.454,0.03,0.0,0.0,0.0,0.03,-4,1,-4,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,introgressed


In [69]:
len(data_nonzero)

188302

Run correlations.

In [70]:
rho, p = spearmanr(data_nonzero['ag_delta'], data_nonzero['al_delta'])
print(rho,p)

-0.14267746909602813 0.0


In [71]:
rho, p = spearmanr(data_nonzero['ag_delta'], data_nonzero['dg_delta'])
print(rho,p)

-0.2239766968064798 0.0


In [72]:
rho, p = spearmanr(data_nonzero['ag_delta'], data_nonzero['dl_delta'])
print(rho,p)

-0.34023522469107176 0.0


In [73]:
rho, p = spearmanr(data_nonzero['al_delta'], data_nonzero['dg_delta'])
print(rho,p)

-0.33985645661346087 0.0


In [74]:
rho, p = spearmanr(data_nonzero['al_delta'], data_nonzero['dl_delta'])
print(rho,p)

-0.2146624536999448 0.0


In [75]:
rho, p = spearmanr(data_nonzero['dg_delta'], data_nonzero['dl_delta'])
print(rho,p)

-0.2036868319219267 0.0


Repeat for splice altering variants.

In [76]:
rho, p = spearmanr(data_2['ag_delta'], data_2['al_delta'])
print(rho,p)

0.1500250528757491 2.7088750503541857e-31


In [77]:
rho, p = spearmanr(data_2['ag_delta'], data_2['dg_delta'])
print(rho,p)

-0.32319408269234423 1.0314796151065124e-144


In [78]:
rho, p = spearmanr(data_2['ag_delta'], data_2['dl_delta'])
print(rho,p)

-0.46608147540520983 0.0


In [79]:
rho, p = spearmanr(data_2['al_delta'], data_2['dg_delta'])
print(rho,p)

-0.48795009291846625 0.0


In [80]:
rho, p = spearmanr(data_2['al_delta'], data_2['dl_delta'])
print(rho,p)

-0.27635622346475713 9.11316178310797e-105


In [81]:
rho, p = spearmanr(data_2['dg_delta'], data_2['dl_delta'])
print(rho,p)

-0.02349927032949021 0.06990668106164923


## Multiple Deltas <a class = 'anchor' id = 'multipledeltas'></a>

How many variants have a single variant effect?

In [82]:
delta_counts = (data_2[['ag_delta','al_delta','dg_delta','dl_delta']] > 0).sum(1).to_frame('count')
len(delta_counts[delta_counts['count'] == 1])

2845

Let's assess the number of variants that result in both a gain and a loss for the donor and acceptor. Presumably, these are strong candidates for alternative splicing.

In [83]:
len(data_nonzero[(data_nonzero['ag_delta'] >= 0.2) & (data_nonzero['al_delta'] >= 0.2)])

119

In [84]:
len(data_nonzero[(data_nonzero['dg_delta'] >= 0.2) & (data_nonzero['dl_delta'] >= 0.2)])

108

Now let's examine when both the donor and acceptor are effected. These events likely represent exon skipping (Jaganathan et al. 2019).

In [85]:
len(data_nonzero[(data_nonzero['ag_delta'] >= 0.2) & (data_nonzero['dg_delta'] >= 0.2)])

119

In [86]:
len(data_nonzero[(data_nonzero['ag_delta'] >= 0.2) & (data_nonzero['dl_delta'] >= 0.2)])

6

In [87]:
len(data_nonzero[(data_nonzero['al_delta'] >= 0.2) & (data_nonzero['dg_delta'] >= 0.2)])

6

In [88]:
len(data_nonzero[(data_nonzero['al_delta'] >= 0.2) & (data_nonzero['dl_delta'] >= 0.2)])

126

Let's save these cases of exon skipping to a dataframe.

In [89]:
exon_skipping_2 = data_nonzero[((data_nonzero['ag_delta'] >= 0.2) & (data_nonzero['dg_delta'] >= 0.2)) | ((data_nonzero['ag_delta'] >= 0.2) & (data_nonzero['dl_delta'] >= 0.2)) | ((data_nonzero['al_delta'] >= 0.2) & (data_nonzero['dg_delta'] >= 0.2)) | ((data_nonzero['al_delta'] >= 0.2) & (data_nonzero['dl_delta'] >= 0.2))] 

In [90]:
len(exon_skipping_2)

252

In [91]:
exon_skipping_5 = data_nonzero[((data_nonzero['ag_delta'] >= 0.5) & (data_nonzero['dg_delta'] >= 0.5)) | ((data_nonzero['ag_delta'] >= 0.5) & (data_nonzero['dl_delta'] >= 0.5)) | ((data_nonzero['al_delta'] >= 0.5) & (data_nonzero['dg_delta'] >= 0.5)) | ((data_nonzero['al_delta'] >= 0.5) & (data_nonzero['dl_delta'] >= 0.5))] 

In [92]:
exon_skipping_5.head(24)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
67925,chr1,98045449,G,C,g,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,49.0,5096.0,0.01,0.0,0.02,0.0,0.01,0.02,0.0,yes,G,C,2.0,0.00318,0.0,0.00576,0.0,0.02386,0.0,0.01943,G,C,,chr1_97954921_98200373,0.00981,yes,1.0,0.01,DPYD,1.0259,-0.21667,0.77223,1.519,-0.797,0.68,0.0,0.77,0.0,0.77,44,-40,1,-31,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,yes,introgressed,introgressed
87181,chr1,162091589,G,T,g,derived,snv,0/0,0/1,0/0,0/0,False,True,False,False,Chagyrskaya,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,NOS1AP,0.67053,2.0372,0.20128,3.689,-0.187,0.51,0.0,0.71,0.0,0.71,-41,50,-2,49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
325697,chr12,2450548,T,G,T,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,259.0,5096.0,0.05,0.0,0.0,0.18,0.02,0.0,0.185185,no,,,,,,,,,,,,,,,,no,,0.05,CACNA1C,0.49513,6.4654,0.048672,8.935,0.533,0.82,-0.0,0.84,0.0,0.84,-24,2,47,-38,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
466947,chr14,55039647,A,G,A,derived,snv,0/0,0/1,0/0,0/1,False,True,False,True,Late Neanderthal,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,SAMD4A,0.74574,1.8422,0.084368,5.0596,-0.379,0.8,0.0,0.93,-0.0,0.93,-45,30,-1,44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
500228,chr14,102798401,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,70.0,5096.0,0.01,0.04,0.0,0.0,0.01,0.02,0.0,yes,A,G,1.0,0.0,0.0,0.01153,0.03968,0.0,0.0,0.02147,A,G,,chr14_102411861_102996413,0.014536,no,,0.01,ZNF839,0.88133,0.92059,0.48075,2.3016,-0.472,0.54,0.0,0.74,0.0,0.74,-44,40,-1,24,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,low-confidence ancient
632233,chr17,14028326,C,A,C,derived,snv,0/0,0/1,0/0,1/1,False,True,False,True,Late Neanderthal,yes,84.0,5096.0,0.02,0.02,0.02,0.0,0.01,0.03,0.0,no,,,,,,,,,,,,,,,,no,,0.02,COX10,1.0316,-0.18011,0.29042,2.7283,0.655,0.51,0.0,0.53,0.0,0.53,-35,37,-3,11,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
758530,chr19,51729564,G,A,G,derived,snv,1/1,0/0,0/0,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CD33,1.0277,-0.14109,0.79732,0.69761,0.462,0.22,0.93,0.0,0.54,0.93,12,1,12,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
767652,chr2,1640011,C,G,C,derived,snv,0/0,1/1,0/0,0/0,False,True,False,False,Chagyrskaya,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PXDN,0.79765,2.2079,0.36899,4.6165,0.467,0.71,0.0,0.71,0.0,0.71,17,-1,-50,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
772767,chr2,11589362,C,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,2635.0,5096.0,0.52,0.61,0.38,0.56,0.57,0.47,0.55848,no,,,,,,,,,,,,,,,,no,,0.52,E2F6,0.78727,0.92156,0.29897,2.3761,-0.31,0.0,0.65,0.28,0.93,0.93,-50,46,32,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,yes,ancient,ancient
807381,chr2,70068635,G,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,3485.0,5096.0,0.68,0.67,0.46,0.95,0.69,0.57,0.974659,no,,,,,,,,,,,,,,,,no,,0.68,GMCL1,0.79158,1.2121,0.49361,2.6716,-1.219,0.0,0.66,-0.0,0.77,0.77,6,-19,-49,49,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,21.0,yes,ancient,ancient


In [93]:
len(exon_skipping_5)

24

In [94]:
exon_skipping_2.to_csv('../dataframes/exon_skipping_2.txt', sep="\t", header = True, index = False)
exon_skipping_5.to_csv('../dataframes/exon_skipping_5.txt', sep="\t", header = True, index = False)

Let's check the distribution and allele origin of these exon skipping variants out of curiosity.

In [95]:
exon_skipping_2.groupby(['distribution']).size().to_frame('distribution')

Unnamed: 0_level_0,distribution
distribution,Unnamed: 1_level_1
Altai,11
Chagyrskaya,8
Denisovan,63
Late Neanderthal,9
Neanderthal,49
Other,15
Shared,88
Vindija,9


In [96]:
exon_skipping_2.groupby(['Vernot_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Vernot_allele_origin,Unnamed: 1_level_1
ancient,92
archaic-specific,92
introgressed,15
low-confidence ancient,53


In [97]:
exon_skipping_2.groupby(['Browning_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Browning_allele_origin,Unnamed: 1_level_1
ancient,88
archaic-specific,92
introgressed,24
low-confidence ancient,48


## SAV Genotypes <a class = 'anchor' id = 'SAVgenotypes'></a>

We just examined the distribution of genotypes but let's do this again for SAVs.

In [98]:
data_2.groupby(['altai_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
altai_gt,Unnamed: 1_level_1
./.,18
0/0,2227
0/1,537
1/0,1
1/1,3167


In [99]:
data_5.groupby(['altai_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
altai_gt,Unnamed: 1_level_1
./.,4
0/0,386
0/1,101
1/1,558


In [100]:
data_2.groupby(['chagyrskaya_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
chagyrskaya_gt,Unnamed: 1_level_1
./.,266
0/0,2202
0/1,395
1/0,2
1/1,3085


In [101]:
data_5.groupby(['chagyrskaya_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
chagyrskaya_gt,Unnamed: 1_level_1
./.,69
0/0,394
0/1,60
1/0,1
1/1,525


In [102]:
data_2.groupby(['denisovan_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
denisovan_gt,Unnamed: 1_level_1
./.,157
0/0,2122
0/1,584
1/0,3
1/1,3084


In [103]:
data_5.groupby(['denisovan_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
denisovan_gt,Unnamed: 1_level_1
./.,45
0/0,376
0/1,106
1/0,1
1/1,521


In [104]:
data_2.groupby(['vindija_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
vindija_gt,Unnamed: 1_level_1
./.,163
0/0,2160
0/1,496
1/1,3131


In [105]:
data_5.groupby(['vindija_gt']).size().to_frame('size')

Unnamed: 0_level_0,size
vindija_gt,Unnamed: 1_level_1
./.,35
0/0,361
0/1,108
1/1,545


## SAV Allele Origin <a class = 'anchor' id = 'SAValleleorigin'></a>

Let's consider the allele origin for our data. Let's start with Vernot.

In [106]:
data.groupby(['Vernot_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Vernot_allele_origin,Unnamed: 1_level_1
ancient,681416
archaic-specific,570050
introgressed,66209
low-confidence ancient,289675


In [107]:
data_2.groupby(['Vernot_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Vernot_allele_origin,Unnamed: 1_level_1
ancient,2252
archaic-specific,2343
introgressed,237
low-confidence ancient,1118


In [108]:
data_5.groupby(['Vernot_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Vernot_allele_origin,Unnamed: 1_level_1
ancient,383
archaic-specific,429
introgressed,56
low-confidence ancient,181


Now for Browning.

In [109]:
data.groupby(['Browning_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Browning_allele_origin,Unnamed: 1_level_1
ancient,670611
archaic-specific,570050
introgressed,95843
low-confidence ancient,270846


In [110]:
data_2.groupby(['Browning_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Browning_allele_origin,Unnamed: 1_level_1
ancient,2195
archaic-specific,2343
introgressed,377
low-confidence ancient,1035


In [111]:
data_5.groupby(['Browning_allele_origin']).size().to_frame('origin')

Unnamed: 0_level_0,origin
Browning_allele_origin,Unnamed: 1_level_1
ancient,376
archaic-specific,429
introgressed,69
low-confidence ancient,175


Let's take a quick peak at the ancient category and add some additional parameters to test the robustness of that category. Presumably, these variants should be shared in the archaics and occur at high frequency in Africans.

In [112]:
len(data_2[(data_2['Vernot_allele_origin'] == 'ancient') & (data_2['distribution'] == 'Shared') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.05)])

1380

In [113]:
1380/2252

0.6127886323268206

In [114]:
len(data_2[(data_2['Browning_allele_origin'] == 'ancient') & (data_2['distribution'] == 'Shared') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.05)])

1380

In [115]:
1380/2195

0.6287015945330297

What about dropping the shared archaic part?

In [116]:
len(data_2[(data_2['Vernot_allele_origin'] == 'ancient') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.05)])

2098

In [117]:
2098/2252

0.9316163410301954

In [118]:
len(data_2[(data_2['Browning_allele_origin'] == 'ancient') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.05)])

2097

In [119]:
2097/2195

0.9553530751708428

How many introgressed variants may have evolved prior to the archaic-modern split?

In [120]:
len(data_2[(data_2['Vernot_allele_origin'] == 'introgressed') & (data_2['distribution'] == 'Shared')])

66

In [121]:
66/237

0.27848101265822783

In [122]:
len(data_2[(data_2['Browning_allele_origin'] == 'introgressed') & (data_2['distribution'] == 'Shared')])

120

In [123]:
120/377

0.3183023872679045

In [124]:
len(data_2[(data_2['Vernot_allele_origin'] == 'introgressed') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.05)])

1

In [125]:
len(data_2[(data_2['Browning_allele_origin'] == 'introgressed') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.05)])

2

In [126]:
len(data_2[(data_2['Vernot_allele_origin'] == 'introgressed') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.01)])

2

In [127]:
len(data_2[(data_2['Browning_allele_origin'] == 'introgressed') & (data_2['1KG_non_ASW_AFR_AF'] >= 0.01)])

16

## Introgression Set Overlap <a class = 'anchor' id = 'introgressionsetoverlap'></a>

How well do Vernot et al. 2016 and Browning et al. 2018 overlap for this set of variants? Let's look at all variants and then SAVs at delta >= 0.2.

In [128]:
introgressed_vars = data[(data['Vernot_allele_origin'] == 'introgressed') | (data['Browning_allele_origin'] == 'introgressed')]
len(introgressed_vars)

112986

In [129]:
len(introgressed_vars[introgressed_vars['Vernot_allele_origin'] == 'introgressed'])

66209

In [130]:
len(introgressed_vars[introgressed_vars['Browning_allele_origin'] == 'introgressed'])

95843

In [131]:
len(introgressed_vars[(introgressed_vars['Vernot_allele_origin'] == 'introgressed') & (introgressed_vars['Browning_allele_origin'] == 'introgressed')])

49066

In [132]:
introgressed_SAVs = data_2[(data_2['Vernot_allele_origin'] == 'introgressed') | (data_2['Browning_allele_origin'] == 'introgressed')]
len(introgressed_SAVs)

447

In [133]:
len(introgressed_SAVs[introgressed_SAVs['Vernot_allele_origin'] == 'introgressed'])

237

In [134]:
len(introgressed_SAVs[introgressed_SAVs['Browning_allele_origin'] == 'introgressed'])

377

In [135]:
len(introgressed_SAVs[(introgressed_SAVs['Vernot_allele_origin'] == 'introgressed') & (introgressed_SAVs['Browning_allele_origin'] == 'introgressed')])

167

## SAV Distribution <a class = 'anchor' id = 'SAVdistribution'></a>

Get distribution of archaic-specific variants for Figure 1. 

In [136]:
archaic_specific = data_2[(data_2['Vernot_allele_origin'] == 'archaic-specific')]
archaic_specific.groupby(['distribution']).size().to_frame('distribution')

Unnamed: 0_level_0,distribution
distribution,Unnamed: 1_level_1
Altai,310
Chagyrskaya,172
Denisovan,956
Late Neanderthal,97
Neanderthal,268
Other,142
Shared,160
Vindija,238


In [137]:
archaic_specific = data_2[(data_2['Browning_allele_origin'] == 'archaic-specific')]
archaic_specific.groupby(['distribution']).size().to_frame('distribution')

Unnamed: 0_level_0,distribution
distribution,Unnamed: 1_level_1
Altai,310
Chagyrskaya,172
Denisovan,956
Late Neanderthal,97
Neanderthal,268
Other,142
Shared,160
Vindija,238


Let's use parsimony to categorize the 'Other' distribution category. 

In [138]:
archaic_specific_other = archaic_specific[(archaic_specific['distribution'] == 'Other')]

DAC

In [139]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == True) & (archaic_specific_other['chagyrskaya_gt_boolean'] == True) & (archaic_specific_other['denisovan_gt_boolean'] == True) & (archaic_specific_other['vindija_gt_boolean'] == False)])

5

DAV

In [140]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == True) & (archaic_specific_other['chagyrskaya_gt_boolean'] == False) & (archaic_specific_other['denisovan_gt_boolean'] == True) & (archaic_specific_other['vindija_gt_boolean'] == True)])

8

DCV

In [141]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == False) & (archaic_specific_other['chagyrskaya_gt_boolean'] == True) & (archaic_specific_other['denisovan_gt_boolean'] == True) & (archaic_specific_other['vindija_gt_boolean'] == True)])

4

DA

In [142]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == True) & (archaic_specific_other['chagyrskaya_gt_boolean'] == False) & (archaic_specific_other['denisovan_gt_boolean'] == True) & (archaic_specific_other['vindija_gt_boolean'] == False)])

21

DC

In [143]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == False) & (archaic_specific_other['chagyrskaya_gt_boolean'] == True) & (archaic_specific_other['denisovan_gt_boolean'] == True) & (archaic_specific_other['vindija_gt_boolean'] == False)])

2

DV

In [144]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == False) & (archaic_specific_other['chagyrskaya_gt_boolean'] == False) & (archaic_specific_other['denisovan_gt_boolean'] == True) & (archaic_specific_other['vindija_gt_boolean'] == True)])

3

AC

In [145]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == True) & (archaic_specific_other['chagyrskaya_gt_boolean'] == True) & (archaic_specific_other['denisovan_gt_boolean'] == False) & (archaic_specific_other['vindija_gt_boolean'] == False)])

43

AV

In [146]:
len(archaic_specific_other[(archaic_specific_other['altai_gt_boolean'] == True) & (archaic_specific_other['chagyrskaya_gt_boolean'] == False) & (archaic_specific_other['denisovan_gt_boolean'] == False) & (archaic_specific_other['vindija_gt_boolean'] == True)])

56

Let's add 43 to the archaic common ancestor and 99 (43+56) to the Neanderthal common ancestor in Figure 1.

# SAV Genes <a class = 'anchor' id = 'SAVgenes'></a>

## Top 20 AG, AL, DG, and DL <a class = 'anchor' id = 'top20'></a>

Now let's take a look at the top 20 variants with the highest deltas for each of the four classes.

In [147]:
data_2_ag = data_2.sort_values(['ag_delta'], ascending = [False])
data_2_ag.head(20)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
79154,chr1,144923837,A,T,,derived,snv,0/1,0/1,./.,0/1,True,True,False,True,Neanderthal,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PDE4DIP,1.1242,-1.403,0.68045,3.1294,-0.264,1.0,0.97,0.0,0.0,1.0,-2,-16,-16,-29,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
304953,chr11,108810969,C,G,C,derived,snv,0/1,0/0,0/0,1/1,True,False,False,True,Other,yes,2.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,no,,0.0,DDX10,0.96014,0.30359,0.70688,1.8275,-0.175,1.0,0.01,0.0,0.0,1.0,1,4,4,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
4268,chr1,5935162,A,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,NPHP4,1.0229,-0.24249,0.78155,1.6669,-0.557,1.0,0.91,0.0,0.0,1.0,-2,-8,-43,-44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
122576,chr1,218475625,T,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,436.0,5096.0,0.09,0.06,0.04,0.23,0.04,0.01,0.230019,no,,,,,,,,,,,,,,,,no,,0.09,RRP15,1.1036,-0.45285,0.62321,1.1702,-0.372,1.0,0.7,0.0,0.0,1.0,2,11,11,15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,yes,ancient,ancient
1485771,chr8,8999187,G,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,2597.0,5096.0,0.51,0.82,0.31,0.37,0.41,0.66,0.367446,no,,,,,,,,,,,,,,,,no,,0.51,PPP1R3B,1.2476,-1.1086,0.3835,1.5978,-0.216,0.99,0.95,0.0,0.0,0.99,-1,-9,-45,-22,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,7.0,yes,ancient,ancient
745753,chr19,35506729,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,1292.0,5096.0,0.25,0.22,0.34,0.27,0.3,0.15,0.276803,no,,,,,,,,,,,,,,,,no,,0.25,GRAMD1A,0.83225,1.2905,0.31602,4.0652,-1.373,0.99,0.01,0.0,0.0,0.99,2,-1,2,-1,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,20.0,yes,ancient,ancient
1562235,chr8,142505038,T,C,C,ancestral,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,MROH5,1.067,-0.19014,0.7872,0.38495,-0.215,0.99,0.82,0.0,0.0,0.99,-1,-9,-11,-12,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1371237,chr7,4215403,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,174.0,5096.0,0.03,0.03,0.03,0.0,0.01,0.1,0.0,yes,A,G,1.0,0.00637,0.0,0.00865,0.02976,0.0338,0.0,0.10327,A,G,,chr7_4179789_4232380,0.035096,no,,0.03,SDK1,1.1614,-2.1004,0.29493,6.6984,0.533,0.99,0.03,0.0,0.0,0.99,1,4,23,14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,yes,introgressed,low-confidence ancient
1296870,chr6,53529252,C,T,C,derived,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,265.0,5096.0,0.05,0.09,0.08,0.0,0.06,0.05,0.0,no,,,,,,,,,,,,,,,,yes,1.0,0.05,KLHL31,0.88648,0.76527,0.56321,1.8683,0.563,0.99,0.04,0.0,0.0,0.99,-2,35,-1,35,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,introgressed
713683,chr18,59805532,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PIGN,0.94399,0.41579,0.78236,1.4055,-0.186,0.99,0.17,0.0,0.0,0.99,-1,-2,0,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


In [148]:
data_2_al = data_2.sort_values(['al_delta'], ascending = [False])
data_2_al.head(20)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
613949,chr16,85105387,A,T,A,derived,snv,0/0,./.,./.,0/1,False,False,False,True,Vindija,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,KIAA0513,1.0504,-0.28542,0.24148,3.5037,0.533,0.86,1.0,0.0,0.0,1.0,6,2,-40,31,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1355450,chr6,160482533,A,C,A,derived,snv,0/1,0/0,0/1,0/0,True,False,True,False,Other,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,IGF2R,0.84588,2.0655,0.14162,8.7149,0.533,0.02,1.0,0.0,0.0,1.0,19,2,44,17,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1087590,chr4,71339723,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,1060.0,5096.0,0.21,0.2,0.13,0.28,0.14,0.25,0.290448,no,,,,,,,,,,,,,,,,no,,0.21,MUC7,1.0655,-0.32621,1.7896,-0.77348,0.563,0.86,1.0,0.0,0.0,1.0,2,1,4,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,yes,ancient,ancient
1016305,chr3,14965507,A,G,G,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4927.0,5096.0,0.97,1.0,0.92,1.0,0.96,0.94,1.0,no,,,,,,,,,,,,,,,,no,,0.97,FGD5,0.98381,0.16707,0.070985,6.4623,-0.684,0.49,0.99,0.0,0.0,0.99,8,2,0,15,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,27.0,yes,ancient,ancient
344092,chr12,27543028,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,31.0,5096.0,0.01,0.0,0.0,0.02,0.0,0.0,0.023392,no,,,,,,,,,,,,,,,,no,,0.01,ARNTL2,0.93937,0.38887,0.60443,2.1085,0.655,0.79,0.99,0.0,0.0,0.99,13,1,3,42,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
801370,chr2,55200335,C,T,C,derived,snv,0/1,0/0,0/1,0/0,True,False,True,False,Other,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,RTN4,1.1329,-1.1176,0.22024,4.3548,0.585,0.03,0.99,0.0,0.0,0.99,-9,-1,-9,-27,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1190907,chr5,82648943,G,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,1907.0,5096.0,0.37,0.7,0.14,0.48,0.33,0.17,0.488304,no,,,,,,,,,,,,,,,,no,,0.37,XRCC4,0.93418,0.29835,0.67533,1.2682,-2.023,0.77,0.99,0.0,0.0,0.99,7,1,32,7,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0,yes,ancient,ancient
678030,chr18,633297,G,A,G,derived,snv,0/1,0/0,0/0,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CLUL1,0.88393,0.63877,0.88322,0.48852,0.655,0.05,0.99,0.0,0.0,0.99,16,1,-47,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1088866,chr4,72994301,G,T,G,derived,snv,0/0,0/0,0/0,0/1,False,False,False,True,Vindija,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,NPFFR2,1.1956,-1.1655,0.71099,1.1452,0.557,0.32,0.99,0.0,0.0,0.99,17,1,17,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
274559,chr11,61725616,A,C,A,derived,snv,0/1,./.,./.,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,BEST1,0.90296,0.61695,1.0059,-0.023276,0.164,0.3,0.99,0.0,0.0,0.99,45,2,-44,16,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


In [149]:
data_2_dg = data_2.sort_values(['dg_delta'], ascending = [False])
data_2_dg.head(20)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
302165,chr11,104761100,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CASP12,0.72169,1.2593,0.60318,1.3392,-0.282,0.0,0.0,1.0,0.0,1.0,46,-42,1,-3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
650160,chr17,45485046,A,G,-,.,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,760.0,5096.0,0.15,0.33,0.08,0.07,0.11,0.17,0.05848,no,,,,,,,,,,,,,,,,no,,0.15,EFCAB13,0.87046,1.0015,1.0685,-0.41674,-0.364,0.0,0.0,0.99,0.16,0.99,-27,28,-1,30,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,7.0,yes,ancient,ancient
951722,chr20,61467445,C,T,C,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,541.0,5096.0,0.11,0.01,0.05,0.32,0.03,0.03,0.32846,no,,,,,,,,,,,,,,,,no,,0.11,COL9A3,1.0508,-0.38224,0.54108,3.0041,0.313,0.0,0.0,0.99,0.1,0.99,-39,-6,-2,-31,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
1406324,chr7,56088811,T,C,T,derived,snv,0/1,0/1,0/1,0/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PSPH,0.80477,0.77719,0.60223,1.1634,-1.431,0.0,0.0,0.99,0.62,0.99,1,26,1,-45,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
19257,chr1,22174518,G,T,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,176.0,5096.0,0.03,0.0,0.05,0.0,0.02,0.11,0.0,yes,G,T,1.0,0.00637,0.0,0.0245,0.0,0.04871,0.0,0.10838,G,T,,chr1_22145179_22179036,0.036318,yes,1.0,0.03,HSPG2,0.93887,1.1367,0.34121,9.2302,0.655,0.0,0.01,0.98,0.0,0.98,-14,38,3,-49,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0,yes,introgressed,introgressed
1037455,chr3,46785355,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,1519.0,5096.0,0.3,0.12,0.15,0.65,0.2,0.24,0.648148,no,,,,,,,,,,,,,,,,no,,0.3,PRSS45,1.0552,-0.22203,0.84629,0.37926,-1.142,0.0,0.0,0.98,0.18,0.98,43,20,1,20,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,11.0,yes,ancient,ancient
622968,chr17,3588807,C,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,929.0,5096.0,0.18,0.0,0.0,0.67,0.04,0.0,0.685185,no,,,,,,,,,,,,,,,,no,,0.18,P2RX5,1.1321,-0.74214,0.87957,0.54528,-0.539,0.0,0.0,0.98,0.01,0.98,2,1,2,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,20.0,yes,low-confidence ancient,low-confidence ancient
622966,chr17,3588807,C,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,929.0,5096.0,0.18,0.0,0.0,0.67,0.04,0.0,0.685185,no,,,,,,,,,,,,,,,,no,,0.18,P2RX5-TAX1BP3,,,,,-0.539,0.0,0.0,0.98,0.01,0.98,2,1,2,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,20.0,yes,low-confidence ancient,low-confidence ancient
116082,chr1,210025062,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,DIEXF,0.9199,0.57543,0.44777,3.0589,0.533,0.0,0.0,0.97,0.0,0.97,-5,7,-5,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
642460,chr17,32690090,G,A,G,derived,snv,1/1,0/0,0/0,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CCL1,0.90113,0.25103,0.89594,0.17645,-0.902,0.0,0.0,0.97,0.05,0.97,50,-25,2,15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


In [150]:
data_2_dl = data_2.sort_values(['dl_delta'], ascending = [False])
data_2_dl.head(20)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
555992,chr15,99514264,C,A,C,derived,snv,0/0,0/0,0/0,0/1,False,False,False,True,Vindija,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PGPEP1L,1.2809,-1.0945,1.6829,-1.5426,0.65,0.0,0.0,0.01,1.0,1.0,-3,49,-3,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
620548,chr17,1386153,A,C,A,derived,snv,0/1,0/1,./.,./.,True,True,False,False,Other,yes,886.0,5096.0,0.17,0.13,0.24,0.17,0.17,0.16,0.154971,no,,,,,,,,,,,,,,,,no,,0.17,MYO1C,1.0699,-0.63663,0.51054,3.7014,0.334,0.0,0.0,0.92,1.0,1.0,11,19,46,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
555840,chr15,99442852,T,A,T,derived,snv,./.,0/0,0/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,IGF1R,0.73351,2.7286,0.18874,5.9942,0.46,0.04,0.0,0.0,1.0,1.0,-20,-31,-20,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
946023,chr20,54963211,C,G,C,derived,snv,0/0,./.,0/1,./.,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,AURKA,0.68585,1.595,0.15147,3.4993,0.585,0.0,0.26,0.41,1.0,1.0,-49,47,-49,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1506261,chr8,33361265,C,T,C,derived,snv,0/0,0/0,0/0,0/1,False,False,False,True,Vindija,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,TTI2,1.0494,-0.28891,0.6573,1.4123,0.65,0.0,0.01,0.02,1.0,1.0,-39,36,-4,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
579166,chr16,20638576,A,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,3083.0,5096.0,0.6,0.85,0.32,0.78,0.34,0.59,0.812865,no,,,,,,,,,,,,,,,,no,,0.6,ACSM1,0.96843,0.20392,0.7786,1.2082,-0.257,0.0,0.0,0.0,0.99,0.99,47,2,-25,2,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,16.0,yes,ancient,ancient
452958,chr14,24568446,G,A,G,derived,snv,0/0,0/0,0/0,0/1,False,False,False,True,Vindija,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PCK2,1.0283,-0.20271,1.0462,-0.24419,0.563,0.0,0.0,0.07,0.99,0.99,10,-1,10,-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1376872,chr7,12393974,A,C,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,4333.0,5096.0,0.85,0.81,0.9,0.81,0.91,0.86,0.803119,no,,,,,,,,,,,,,,,,no,,0.85,VWDE,1.0157,-0.15587,0.90845,0.71763,0.533,0.0,0.0,0.01,0.99,0.99,-26,34,47,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0,yes,ancient,ancient
1183982,chr5,74062762,C,G,C,derived,snv,0/0,0/0,1/1,./.,False,False,True,False,Denisovan,yes,1.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,no,,0.0,GFM2,0.88906,0.79658,0.77687,1.3476,0.462,0.0,0.0,0.14,0.99,0.99,45,0,37,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
758443,chr19,51535130,A,G,A,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,2725.0,5096.0,0.53,0.64,0.56,0.36,0.56,0.62,0.359649,no,,,,,,,,,,,,,,,,no,,0.53,KLK12,1.0742,-0.3296,0.75715,0.73148,0.528,0.0,0.0,0.18,0.99,0.99,-19,28,-26,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,7.0,yes,ancient,ancient


## N Genes <a class = 'anchor' id = 'ngenes'></a>

Now let's get a list of genes for all variants with delta >= 0.2.

In [151]:
all_genes_2 = data_2['annotation'].to_frame('gene')
all_genes_2 = all_genes_2.sort_values(['gene'], ascending = [True])
all_genes_2.head(10)

Unnamed: 0,gene
765423,A1BG
11625,AADACL4
1141655,AADAT
890176,AAMP
890173,AAMP
932716,AAR2
1291650,AARS2
643751,AATF
674279,AATK
568627,ABAT


In [152]:
len(all_genes_2)

5950

How many unique genes are contained in these vectors?

In [153]:
len(all_genes_2.drop_duplicates())

4242

Save this gene list.

In [154]:
all_genes_2.to_csv('../genes/all_genes_2.txt', sep="\t", header=False, index=False)

Repeat for delta >= 0.5.

In [155]:
all_genes_5 = data_5['annotation'].to_frame('gene')
all_genes_5 = all_genes_5.sort_values(['gene'], ascending = [True])
all_genes_5.head(10)

Unnamed: 0,gene
890176,AAMP
932716,AAR2
643751,AATF
663821,ABCA5
723463,ABCA7
1463600,ABCB8
576673,ABCC1
576747,ABCC1
585863,ABCC12
576972,ABCC6


In [156]:
len(all_genes_5)

1049

In [157]:
len(all_genes_5.drop_duplicates())

962

In [158]:
all_genes_5.to_csv('../genes/all_genes_5.txt', sep="\t", header=False, index=False)

## Distribution of N Genes <a class = 'anchor' id = 'ngenedistribution'></a>

Let's get the distribution of N SAVs per gene.

In [159]:
SAVs_per_gene = data_2.groupby(by=['annotation']).size().reset_index(name='N_SAVs')
SAVs_per_gene.groupby('N_SAVs').size().to_frame('N')

Unnamed: 0_level_0,N
N_SAVs,Unnamed: 1_level_1
1,3111
2,769
3,246
4,71
5,19
6,14
7,6
8,1
9,2
10,1


In [160]:
SAVs_per_gene[SAVs_per_gene['N_SAVs']==7]

Unnamed: 0,annotation,N_SAVs
682,CDH13,7
686,CDH23,7
907,CSMD1,7
1059,DNAH17,7
1662,HLA-DPA1,7
1942,LAMA5,7


In [161]:
SAVs_per_gene[SAVs_per_gene['N_SAVs']==8]

Unnamed: 0,annotation,N_SAVs
108,ADARB2,8


In [162]:
SAVs_per_gene[SAVs_per_gene['N_SAVs']==9]

Unnamed: 0,annotation,N_SAVs
3198,SDK1,9
4093,WWOX,9


In [163]:
SAVs_per_gene[SAVs_per_gene['N_SAVs']==10]

Unnamed: 0,annotation,N_SAVs
1663,HLA-DPB1,10


In [164]:
SAVs_per_gene[SAVs_per_gene['N_SAVs']==11]

Unnamed: 0,annotation,N_SAVs
829,CNTNAP2,11
3508,SSPO,11


Let's take a closer look at the genes with multiple SAVs. 

How often do genes with >= 2 SAVs match in distribution or allele origin?

In [165]:
multiple_SAVs_per_gene = data_2[data_2.duplicated(subset=['annotation','distribution'])]
multiple_SAVs_per_gene.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
239,chr1,982099,T,C,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,AGRN,0.98257,0.22552,0.31914,6.0143,-0.375,0.0,0.0,0.04,0.48,0.48,-48,-6,16,-6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1205,chr1,1872053,T,C,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,yes,1.0,,CFAP74,,,,,-0.669,0.22,0.0,0.0,0.0,0.22,49,-2,49,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
1920,chr1,2526280,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,MMEL1,0.98394,0.12602,0.76824,1.5689,-0.242,0.0,0.01,0.03,0.25,0.25,-37,39,-37,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
3387,chr1,3520127,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4811.0,5096.0,0.94,0.98,0.97,0.86,0.99,0.95,0.85575,no,,,,,,,,,,,,,,,,no,,0.94,MEGF6,0.95847,0.45803,0.73856,2.2728,-1.183,0.0,0.0,0.0,0.24,0.24,-28,47,1,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
7161,chr1,7731131,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CAMTA1,0.71229,3.2619,0.091344,7.371,0.655,0.0,0.0,0.02,0.71,0.71,48,-5,-34,-5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
15433,chr1,17715723,G,C,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PADI6,,,,,0.561,0.0,0.0,0.0,0.3,0.3,3,-5,16,-5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
22169,chr1,26359246,T,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4456.0,5096.0,0.87,0.93,0.89,0.81,0.93,0.83,0.810916,no,,,,,,,,,,,,,,,,no,,0.87,EXTL1,0.87054,0.9038,0.75162,1.2166,-0.527,0.0,0.0,0.04,0.61,0.61,-50,-2,10,-2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,yes,ancient,ancient
22347,chr1,26584637,G,A,G,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,1.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,yes,G,A,1.0,0.0,0.0,0.0,0.0,0.00099,0.0,0.0,A,A,,chr1_26583921_26646198,0.000198,yes,1.0,0.0,CEP85,0.96911,0.22411,0.33885,3.938,0.563,0.0,0.26,0.0,0.0,0.26,3,1,-4,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,introgressed
34345,chr1,46522577,A,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,3241.0,5096.0,0.64,0.68,0.73,0.59,0.62,0.56,0.583821,no,,,,,,,,,,,,,,,,no,,0.64,PIK3R3,0.99209,0.044445,0.68427,1.5417,-2.533,0.0,0.0,0.43,0.04,0.43,22,-4,0,-4,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,19.0,yes,ancient,ancient
34678,chr1,46877284,T,G,G,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,42.0,5096.0,0.01,0.02,0.0,0.01,0.0,0.02,0.004873,no,,,,,,,,,,,,,,,,yes,1.0,0.01,FAAH,0.89767,0.63763,0.72544,1.3359,-0.121,0.28,0.0,0.03,0.0,0.28,0,-49,39,-24,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,introgressed


In [166]:
len(multiple_SAVs_per_gene)

591

In [167]:
591/2839

0.20817189151109547

In [168]:
multiple_SAVs_per_gene = data_2[data_2.duplicated(subset=['annotation','Vernot_allele_origin'])]
multiple_SAVs_per_gene.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
233,chr1,980460,G,A,G,derived,snv,1/1,./.,1/1,1/1,True,False,True,True,Other,yes,4642.0,5096.0,0.91,1.0,0.93,0.77,0.93,0.97,0.781676,no,,,,,,,,,,,,,,,,no,,0.91,AGRN,0.98257,0.22552,0.31914,6.0143,0.561,0.0,0.0,0.03,0.43,0.43,30,-33,30,0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,14.0,yes,ancient,ancient
494,chr1,1226233,G,A,g,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,211.0,5096.0,0.04,0.0,0.0,0.15,0.01,0.0,0.160819,no,,,,,,,,,,,,,,,,no,,0.04,SCNN1D,1.1005,-0.78105,1.2373,-1.3261,0.299,0.4,0.01,0.0,0.0,0.4,-10,42,41,-44,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,yes,low-confidence ancient,low-confidence ancient
1205,chr1,1872053,T,C,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,yes,1.0,,CFAP74,,,,,-0.669,0.22,0.0,0.0,0.0,0.22,49,-2,49,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
3314,chr1,3498077,C,T,C,derived,snv,0/0,1/1,0/0,0/0,False,True,False,False,Chagyrskaya,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,MEGF6,0.95847,0.45803,0.73856,2.2728,-1.579,0.0,0.0,0.3,0.0,0.3,-2,-44,-1,-44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
3387,chr1,3520127,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4811.0,5096.0,0.94,0.98,0.97,0.86,0.99,0.95,0.85575,no,,,,,,,,,,,,,,,,no,,0.94,MEGF6,0.95847,0.45803,0.73856,2.2728,-1.183,0.0,0.0,0.0,0.24,0.24,-28,47,1,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient
4268,chr1,5935162,A,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,NPHP4,1.0229,-0.24249,0.78155,1.6669,-0.557,1.0,0.91,0.0,0.0,1.0,-2,-8,-43,-44,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
4486,chr1,6104558,T,A,T,derived,snv,0/0,0/0,0/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,KCNAB2,0.55002,2.5963,0.18315,3.955,0.377,0.0,0.0,0.25,0.0,0.25,-4,32,-4,-36,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
6787,chr1,7568548,C,T,T,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CAMTA1,0.71229,3.2619,0.091344,7.371,-1.546,0.0,0.32,0.0,0.21,0.32,-2,-42,-6,36,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7161,chr1,7731131,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,CAMTA1,0.71229,3.2619,0.091344,7.371,0.655,0.0,0.0,0.02,0.71,0.71,48,-5,-34,-5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
9162,chr1,10231046,G,T,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,UBE4B,0.62481,3.726,0.074661,7.0171,0.591,0.35,0.0,0.01,0.0,0.35,12,36,30,-4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


In [169]:
len(multiple_SAVs_per_gene)

824

In [170]:
824/2839

0.29024304332511447

In [171]:
data_2.groupby(by=['Vernot_introgressed','Vernot_allele_origin']).size().reset_index(name='count')

Unnamed: 0,Vernot_introgressed,Vernot_allele_origin,count
0,no,ancient,2252
1,no,archaic-specific,2321
2,no,low-confidence ancient,1116
3,yes,archaic-specific,22
4,yes,introgressed,237
5,yes,low-confidence ancient,2


In [172]:
data_2.groupby(by=['present_in_1KG','Vernot_allele_origin']).size().reset_index(name='count')

Unnamed: 0,present_in_1KG,Vernot_allele_origin,count
0,no,archaic-specific,2343
1,no,low-confidence ancient,82
2,yes,ancient,2252
3,yes,introgressed,237
4,yes,low-confidence ancient,1036


In [173]:
data_2.groupby(by=['Browning_introgressed','Browning_allele_origin']).size().reset_index(name='count')

Unnamed: 0,Browning_introgressed,Browning_allele_origin,count
0,no,ancient,2195
1,no,archaic-specific,2331
2,no,low-confidence ancient,1032
3,yes,archaic-specific,12
4,yes,introgressed,377
5,yes,low-confidence ancient,3


In [174]:
data_2.groupby(by=['present_in_1KG','Browning_allele_origin']).size().reset_index(name='count')

Unnamed: 0,present_in_1KG,Browning_allele_origin,count
0,no,archaic-specific,2343
1,no,low-confidence ancient,82
2,yes,ancient,2195
3,yes,introgressed,377
4,yes,low-confidence ancient,953


## N Genes by Origin <a class = 'anchor' id = 'ngenesbyorigin'></a>

Let's get gene lists for archaic-specific variants. Our current filtering has resulted in complete agreement between Vernot and Browning for archaic-specific variants so we will only worry about those lists for ancient and introgressed.

In [175]:
archaic_specific_genes_2 = data_2.loc[(data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_genes_2 = archaic_specific_genes_2.sort_values(['gene'], ascending = [True])
archaic_specific_genes_2.to_csv('../genes/archaic_specific_genes_2.txt', sep="\t", header=False, index=False)

In [176]:
archaic_specific_genes_5 = data_5.loc[(data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_genes_5 = archaic_specific_genes_5.sort_values(['gene'], ascending = [True])
archaic_specific_genes_5.to_csv('../genes/archaic_specific_genes_5.txt', sep="\t", header=False, index=False)

How many unique genes are contained in archaic-specific vectors?

In [177]:
len(archaic_specific_genes_2.drop_duplicates())

2012

In [178]:
len(archaic_specific_genes_5.drop_duplicates())

411

And for ancient and introgressed alleles.

In [179]:
Vernot_ancient_genes_2 = data_2.loc[(data_2['Vernot_allele_origin']=='ancient'), 'annotation'].to_frame('gene')
Vernot_ancient_genes_2 = Vernot_ancient_genes_2.sort_values(['gene'], ascending = [True])
Vernot_ancient_genes_2.to_csv('../genes/Vernot_ancient_genes_2.txt', sep="\t", header=False, index=False)

In [180]:
Vernot_ancient_genes_5 = data_5.loc[(data_5['Vernot_allele_origin']=='ancient'), 'annotation'].to_frame('gene')
Vernot_ancient_genes_5 = Vernot_ancient_genes_5.sort_values(['gene'], ascending = [True])
Vernot_ancient_genes_5.to_csv('../genes/Vernot_ancient_genes_5.txt', sep="\t", header=False, index=False)

In [181]:
Browning_ancient_genes_2 = data_2.loc[(data_2['Browning_allele_origin']=='ancient'), 'annotation'].to_frame('gene')
Browning_ancient_genes_2 = Browning_ancient_genes_2.sort_values(['gene'], ascending = [True])
Browning_ancient_genes_2.to_csv('../genes/Browning_ancient_genes_2.txt', sep="\t", header=False, index=False)

In [182]:
Browning_ancient_genes_5 = data_5.loc[(data_5['Browning_allele_origin']=='ancient'), 'annotation'].to_frame('gene')
Browning_ancient_genes_5 = Browning_ancient_genes_5.sort_values(['gene'], ascending = [True])
Browning_ancient_genes_5.to_csv('../genes/Browning_ancient_genes_5.txt', sep="\t", header=False, index=False)

Check the numebr of ancient genes.

In [183]:
len(Vernot_ancient_genes_2.drop_duplicates())

1896

In [184]:
len(Vernot_ancient_genes_5.drop_duplicates())

370

In [185]:
len(Browning_ancient_genes_2.drop_duplicates())

1856

In [186]:
len(Browning_ancient_genes_5.drop_duplicates())

363

In [187]:
Vernot_introgressed_genes_2 = data_2.loc[(data_2['Vernot_allele_origin']=='introgressed'), 'annotation'].to_frame('gene')
Vernot_introgressed_genes_2 = Vernot_introgressed_genes_2.sort_values(['gene'], ascending = [True])
Vernot_introgressed_genes_2.to_csv('../genes/Vernot_introgressed_genes_2.txt', sep="\t", header=False, index=False)

In [188]:
Vernot_introgressed_genes_5 = data_5.loc[(data_5['Vernot_allele_origin']=='introgressed'), 'annotation'].to_frame('gene')
Vernot_introgressed_genes_5 = Vernot_introgressed_genes_5.sort_values(['gene'], ascending = [True])
Vernot_introgressed_genes_5.to_csv('../genes/Vernot_introgressed_genes_5.txt', sep="\t", header=False, index=False)

In [189]:
Browning_introgressed_genes_2 = data_2.loc[(data_2['Browning_allele_origin']=='introgressed'), 'annotation'].to_frame('gene')
Browning_introgressed_genes_2 = Browning_introgressed_genes_2.sort_values(['gene'], ascending = [True])
Browning_introgressed_genes_2.to_csv('../genes/Browning_introgressed_genes_2.txt', sep="\t", header=False, index=False)

In [190]:
Browning_introgressed_genes_5 = data_5.loc[(data_5['Browning_allele_origin']=='introgressed'), 'annotation'].to_frame('gene')
Browning_introgressed_genes_5 = Browning_introgressed_genes_5.sort_values(['gene'], ascending = [True])
Browning_introgressed_genes_5.to_csv('../genes/Browning_introgressed_genes_5.txt', sep="\t", header=False, index=False)

Check the numebr of introgressed genes.

In [191]:
len(Vernot_introgressed_genes_2.drop_duplicates())

232

In [192]:
len(Vernot_introgressed_genes_5.drop_duplicates())

56

In [193]:
len(Browning_introgressed_genes_2.drop_duplicates())

353

In [194]:
len(Browning_introgressed_genes_5.drop_duplicates())

65

## N Genes by Distribution <a class = 'anchor' id = 'ngenesbydistribution'></a>

In [195]:
altai_genes = data_2.loc[(data_2['altai_gt_boolean']==True), 'annotation'].to_frame('gene')
chagyrskaya_genes = data_2.loc[(data_2['chagyrskaya_gt_boolean']==True), 'annotation'].to_frame('gene')
denisovan_genes = data_2.loc[(data_2['denisovan_gt_boolean']==True), 'annotation'].to_frame('gene')
vindija_genes = data_2.loc[(data_2['vindija_gt_boolean']==True), 'annotation'].to_frame('gene')

In [196]:
altai_genes = altai_genes.drop_duplicates()
chagyrskaya_genes = chagyrskaya_genes.drop_duplicates()
denisovan_genes = denisovan_genes.drop_duplicates()
vindija_genes = vindija_genes.drop_duplicates()

In [197]:
len(altai_genes)

2914

In [198]:
len(chagyrskaya_genes)

2755

In [199]:
len(denisovan_genes)

2910

In [200]:
len(vindija_genes)

2889

## Archaic-Specific Genes <a class = 'anchor' id = 'archaicspecificgenes'></a>

Now let's dig into the various archaic-specific distributions.

In [201]:
archaic_specific_altai_genes_2 = data_2.loc[(data_2['distribution']=='Altai') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_altai_genes_2 = archaic_specific_altai_genes_2.sort_values(['gene'], ascending = [True])

archaic_specific_chagyrskaya_genes_2 = data_2.loc[(data_2['distribution']=='Chagyrskaya') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_chagyrskaya_genes_2 = archaic_specific_chagyrskaya_genes_2.sort_values(['gene'], ascending = [True])

archaic_specific_denisovan_genes_2 = data_2.loc[(data_2['distribution']=='Denisovan') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_denisovan_genes_2 = archaic_specific_denisovan_genes_2.sort_values(['gene'], ascending = [True])

archaic_specific_vindija_genes_2 = data_2.loc[(data_2['distribution']=='Vindija') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_vindija_genes_2 = archaic_specific_vindija_genes_2.sort_values(['gene'], ascending = [True])

In [202]:
archaic_specific_late_neanderthal_genes_2 = data_2.loc[(data_2['distribution']=='Late Neanderthal') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_late_neanderthal_genes_2 = archaic_specific_late_neanderthal_genes_2.sort_values(['gene'], ascending = [True])

archaic_specific_neanderthal_genes_2 = data_2.loc[(data_2['distribution']=='Neanderthal') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_neanderthal_genes_2 = archaic_specific_neanderthal_genes_2.sort_values(['gene'], ascending = [True])

archaic_specific_shared_genes_2 = data_2.loc[(data_2['distribution']=='Shared') & (data_2['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_shared_genes_2 = archaic_specific_shared_genes_2.sort_values(['gene'], ascending = [True])

In [203]:
archaic_specific_altai_genes_2.to_csv('../genes/archaic_specific_altai_genes_2.txt', sep="\t", header=False, index=False)
archaic_specific_chagyrskaya_genes_2.to_csv('../genes/archaic_specific_chagyrskaya_genes_2.txt', sep="\t", header=False, index=False)
archaic_specific_denisovan_genes_2.to_csv('../genes/archaic_specific_denisovan_genes_2.txt', sep="\t", header=False, index=False)
archaic_specific_vindija_genes_2.to_csv('../genes/archaic_specific_vindija_genes_2.txt', sep="\t", header=False, index=False)
archaic_specific_late_neanderthal_genes_2.to_csv('../genes/archaic_specific_late_neanderthal_genes_2.txt', sep="\t", header=False, index=False)
archaic_specific_neanderthal_genes_2.to_csv('../genes/archaic_specific_neanderthal_genes_2.txt', sep="\t", header=False, index=False)
archaic_specific_shared_genes_2.to_csv('../genes/archaic_specific_shared_genes_2.txt', sep="\t", header=False, index=False)

How many unique genes are contained in lineage-specific vectors?

In [204]:
archaic_specific_altai_genes_2_no_dups = archaic_specific_altai_genes_2.drop_duplicates()
len(archaic_specific_altai_genes_2_no_dups)

292

In [205]:
archaic_specific_chagyrskaya_genes_2_no_dups = archaic_specific_chagyrskaya_genes_2.drop_duplicates()
len(archaic_specific_chagyrskaya_genes_2_no_dups)

167

In [206]:
archaic_specific_denisovan_genes_2_no_dups = archaic_specific_denisovan_genes_2.drop_duplicates()
len(archaic_specific_denisovan_genes_2_no_dups)

886

In [207]:
archaic_specific_vindija_genes_2_no_dups = archaic_specific_vindija_genes_2.drop_duplicates()
len(archaic_specific_vindija_genes_2_no_dups)

234

In [208]:
archaic_specific_late_neanderthal_genes_2_no_dups = archaic_specific_late_neanderthal_genes_2.drop_duplicates()
len(archaic_specific_late_neanderthal_genes_2_no_dups)

96

In [209]:
archaic_specific_neanderthal_genes_2_no_dups = archaic_specific_neanderthal_genes_2.drop_duplicates()
len(archaic_specific_neanderthal_genes_2_no_dups)

257

In [210]:
archaic_specific_shared_genes_2_no_dups = archaic_specific_shared_genes_2.drop_duplicates()
len(archaic_specific_shared_genes_2_no_dups)

152

Repeat for delta >= 0.5.

In [211]:
archaic_specific_altai_genes_5 = data_5.loc[(data_5['distribution']=='Altai') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_altai_genes_5 = archaic_specific_altai_genes_5.sort_values(['gene'], ascending = [True])

archaic_specific_chagyrskaya_genes_5 = data_5.loc[(data_5['distribution']=='Chagyrskaya') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_chagyrskaya_genes_5 = archaic_specific_chagyrskaya_genes_5.sort_values(['gene'], ascending = [True])

archaic_specific_denisovan_genes_5 = data_5.loc[(data_5['distribution']=='Denisovan') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_denisovan_genes_5 = archaic_specific_denisovan_genes_5.sort_values(['gene'], ascending = [True])

archaic_specific_vindija_genes_5 = data_5.loc[(data_5['distribution']=='Vindija') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_vindija_genes_5 = archaic_specific_vindija_genes_5.sort_values(['gene'], ascending = [True])

In [212]:
archaic_specific_late_neanderthal_genes_5 = data_5.loc[(data_5['distribution']=='Late Neanderthal') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_late_neanderthal_genes_5 = archaic_specific_late_neanderthal_genes_5.sort_values(['gene'], ascending = [True])

archaic_specific_neanderthal_genes_5 = data_5.loc[(data_5['distribution']=='Neanderthal') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_neanderthal_genes_5 = archaic_specific_neanderthal_genes_5.sort_values(['gene'], ascending = [True])

archaic_specific_shared_genes_5 = data_5.loc[(data_5['distribution']=='Shared') & (data_5['Vernot_allele_origin']=='archaic-specific'), 'annotation'].to_frame('gene')
archaic_specific_shared_genes_5 = archaic_specific_shared_genes_5.sort_values(['gene'], ascending = [True])

In [213]:
archaic_specific_altai_genes_5.to_csv('../genes/archaic_specific_altai_genes_5.txt', sep='\t', header=False, index=False)
archaic_specific_chagyrskaya_genes_5.to_csv('../genes/archaic_specific_chagyrskaya_genes_5.txt', sep='\t', header=False, index=False)
archaic_specific_denisovan_genes_5.to_csv('../genes/archaic_specific_denisovan_genes_5.txt', sep='\t', header=False, index=False)
archaic_specific_vindija_genes_5.to_csv('../genes/archaic_specific_vindija_genes_5.txt', sep='\t', header=False, index=False)
archaic_specific_late_neanderthal_genes_5.to_csv('../genes/archaic_specific_late_neanderthal_genes_5.txt', sep='\t', header=False, index=False)
archaic_specific_neanderthal_genes_5.to_csv('../genes/archaic_specific_neanderthal_genes_5.txt', sep='\t', header=False, index=False)
archaic_specific_shared_genes_5.to_csv('../genes/archaic_specific_shared_genes_5.txt', sep='\t', header=False, index=False)

In [214]:
len(archaic_specific_altai_genes_5.drop_duplicates())

52

In [215]:
len(archaic_specific_chagyrskaya_genes_5.drop_duplicates())

27

In [216]:
len(archaic_specific_denisovan_genes_5.drop_duplicates())

168

In [217]:
len(archaic_specific_vindija_genes_5.drop_duplicates())

57

In [218]:
len(archaic_specific_late_neanderthal_genes_5.drop_duplicates())

10

In [219]:
len(archaic_specific_neanderthal_genes_5.drop_duplicates())

42

In [220]:
len(archaic_specific_shared_genes_5.drop_duplicates())

32

## Gene Overlap <a class = 'anchor' id = 'geneoverlap'></a>

Do genes overlap in the archaic-specific sets?

In [221]:
neanderthal_altai_genes = pd.merge(archaic_specific_neanderthal_genes_2_no_dups, archaic_specific_altai_genes_2_no_dups)
len(neanderthal_altai_genes)

5

In [222]:
neanderthal_chagyrskaya_genes = pd.merge(archaic_specific_neanderthal_genes_2_no_dups, archaic_specific_chagyrskaya_genes_2_no_dups)
len(neanderthal_chagyrskaya_genes)

3

In [223]:
neanderthal_denisovan_genes = pd.merge(archaic_specific_neanderthal_genes_2_no_dups, archaic_specific_denisovan_genes_2_no_dups)
len(neanderthal_denisovan_genes)

26

In [224]:
neanderthal_vindija_genes = pd.merge(archaic_specific_neanderthal_genes_2_no_dups, archaic_specific_vindija_genes_2_no_dups)
len(neanderthal_vindija_genes)

4

In [225]:
altai_chagyrskaya_genes = pd.merge(archaic_specific_altai_genes_2_no_dups, archaic_specific_chagyrskaya_genes_2_no_dups)
len(altai_chagyrskaya_genes)

9

In [226]:
altai_denisovan_genes = pd.merge(archaic_specific_altai_genes_2_no_dups, archaic_specific_denisovan_genes_2_no_dups)
len(altai_denisovan_genes)

34

In [227]:
altai_vindija_genes = pd.merge(archaic_specific_altai_genes_2_no_dups, archaic_specific_vindija_genes_2_no_dups)
len(altai_vindija_genes)

4

In [228]:
chagyrskaya_denisovan_genes = pd.merge(archaic_specific_chagyrskaya_genes_2_no_dups, archaic_specific_denisovan_genes_2_no_dups)
len(chagyrskaya_denisovan_genes)

21

In [229]:
chagyrskaya_vindija_genes = pd.merge(archaic_specific_chagyrskaya_genes_2_no_dups, archaic_specific_vindija_genes_2_no_dups)
len(chagyrskaya_vindija_genes)

6

In [230]:
denisovan_vindija_genes = pd.merge(archaic_specific_denisovan_genes_2_no_dups, archaic_specific_vindija_genes_2_no_dups)
len(denisovan_vindija_genes)

20

# Comparisons <a class = 'anchor' id = 'comparisons'></a>

## SAV Genes and DR Genes <a class = 'anchor' id = 'drgenes'></a>

Let's take a second to compare genes with SAVs with divergently regulated genes from Colbran et al. 2019.

In [231]:
altai_genes = data_2.loc[(data_2['altai_gt_boolean']==True), 'annotation'].to_frame('gene')
denisovan_genes = data_2.loc[(data_2['denisovan_gt_boolean']==True), 'annotation'].to_frame('gene')
vindija_genes = data_2.loc[(data_2['vindija_gt_boolean']==True), 'annotation'].to_frame('gene')

In [232]:
altai_genes = altai_genes.drop_duplicates()
denisovan_genes = denisovan_genes.drop_duplicates()
vindija_genes = vindija_genes.drop_duplicates()

In [233]:
len(altai_genes)

2914

In [234]:
len(denisovan_genes)

2910

In [235]:
len(vindija_genes)

2889

In [236]:
altai_DR_genes = pd.read_csv('../DR_genes/altai_DR_genes.txt', sep = '\t', names = ['gene'])
denisovan_DR_genes = pd.read_csv('../DR_genes/denisovan_DR_genes.txt', sep = '\t', names = ['gene'])
vindija_DR_genes = pd.read_csv('../DR_genes/vindija_DR_genes.txt', sep = '\t', names = ['gene'])

In [237]:
len(altai_DR_genes)

1419

In [238]:
len(denisovan_DR_genes)

1171

In [239]:
len(vindija_DR_genes)

1484

In [240]:
altai_DR_overlap = pd.merge(altai_genes, altai_DR_genes)
len(altai_DR_overlap)

213

In [241]:
denisovan_DR_overlap = pd.merge(denisovan_genes, denisovan_DR_genes)
len(denisovan_DR_overlap)

154

In [242]:
vindija_DR_overlap = pd.merge(vindija_genes, vindija_DR_genes)
len(vindija_DR_overlap)

222

In [243]:
altai_DR_overlap.to_csv('../DR_genes/altai_DR_overlap.txt', sep='\t', header=False, index=False)
denisovan_DR_overlap.to_csv('../DR_genes/denisovan_DR_overlap.txt', sep='\t', header=False, index=False)
vindija_DR_overlap.to_csv('../DR_genes/vindija_DR_overlap.txt', sep='\t', header=False, index=False)

## SAV Genes and Circadian Genes <a class = 'anchor' id = 'circadiangenes'></a>

Let's also the overlap between genes involved in circadian biology (Velázquez-Arcelay et al., in prep) and our genes with SAVs.

In [244]:
archaic_specific_genes_2_no_dups = archaic_specific_genes_2.drop_duplicates()
Vernot_introgressed_genes_2_no_dups = Vernot_introgressed_genes_2.drop_duplicates()

In [245]:
circadian_genes = pd.read_csv('../circadian_genes/circadian_genes.txt', sep = '\t', header = 0)
circadian_genes = circadian_genes[['GeneName']]
circadian_genes.rename(columns={'GeneName': 'gene'}, inplace=True)

In [246]:
circadian_archaic_specific_overlap = pd.merge(archaic_specific_genes_2_no_dups, circadian_genes)
len(circadian_archaic_specific_overlap)

28

In [247]:
circadian_introgressed_overlap = pd.merge(Vernot_introgressed_genes_2_no_dups, circadian_genes)
len(circadian_introgressed_overlap)

3

In [248]:
circadian_archaic_specific_overlap.to_csv('../circadian_genes/circadian_genes_archaic_specific_overlap.txt', sep='\t', header=False, index=False)
circadian_introgressed_overlap.to_csv('../circadian_genes/circadian_genes_introgressed_overlap.txt', sep='\t', header=False, index=False)

Let's also subset the main dataframe for any variants that occur in circadian genes.

In [249]:
circadian_genes.head(5)

Unnamed: 0,gene
0,PER3
1,RERE
2,DNAJC16
3,ECE1
4,HTR1D


In [250]:
circadian_mask = data['annotation'].isin(circadian_genes['gene'])
circadian_data = data[circadian_mask]
circadian_data.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
7314,chr1,7845695,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,3690.0,5096.0,0.72,0.76,0.66,0.78,0.76,0.65,0.785575,no,,,,,,,,,,,,,,,,no,,0.72,PER3,1.0116,-0.10855,0.63671,2.5313,-0.215,0.0,0.0,0.0,0.02,0.02,36,19,-49,19,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,16.0,yes,ancient,ancient
7315,chr1,7846527,A,C,A,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,2249.0,5096.0,0.44,0.45,0.37,0.62,0.26,0.38,0.639376,no,,,,,,,,,,,,,,,,no,,0.44,PER3,1.0116,-0.10855,0.63671,2.5313,-0.379,0.0,0.0,0.0,0.0,0.0,13,-7,-12,-7,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,20.0,yes,ancient,ancient
7316,chr1,7849200,A,T,A,derived,snv,1/1,./.,1/1,1/1,True,False,True,True,Other,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PER3,1.0116,-0.10855,0.63671,2.5313,0.172,0.0,0.0,0.0,0.0,0.0,22,-20,-2,-27,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7317,chr1,7853294,A,G,A,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PER3,1.0116,-0.10855,0.63671,2.5313,-0.371,0.0,0.0,0.0,0.0,0.0,-47,17,-47,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7318,chr1,7853884,G,A,A,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PER3,1.0116,-0.10855,0.63671,2.5313,-2.062,0.0,0.0,0.0,0.0,0.0,-30,47,45,-48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7319,chr1,7855814,A,G,A,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,2757.0,5096.0,0.54,0.8,0.37,0.62,0.46,0.39,0.638402,no,,,,,,,,,,,,,,,,no,,0.54,PER3,1.0116,-0.10855,0.63671,2.5313,0.482,0.02,0.0,0.0,0.0,0.02,1,40,1,-36,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,18.0,yes,ancient,ancient
7320,chr1,7856346,T,C,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,1584.0,5096.0,0.31,0.19,0.38,0.27,0.29,0.43,0.264133,no,,,,,,,,,,,,,,,,no,,0.31,PER3,1.0116,-0.10855,0.63671,2.5313,0.358,0.0,0.0,0.0,0.0,0.0,-28,29,8,-42,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,13.0,yes,ancient,ancient
7321,chr1,7857156,T,A,T,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PER3,1.0116,-0.10855,0.63671,2.5313,-0.372,0.0,0.0,0.0,0.0,0.0,-12,15,17,-6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7322,chr1,7858042,G,A,G,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,PER3,1.0116,-0.10855,0.63671,2.5313,-0.14,0.0,0.0,0.0,0.0,0.0,2,33,2,-39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
7323,chr1,7858925,T,C,C,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,4973.0,5096.0,0.98,1.0,0.95,0.98,0.97,0.98,0.978558,no,,,,,,,,,,,,,,,,no,,0.98,PER3,1.0116,-0.10855,0.63671,2.5313,-0.726,0.0,0.0,0.0,0.0,0.0,-18,13,46,-2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,ancient


In [251]:
len(circadian_data)

31775

In [252]:
circadian_data.to_csv('../circadian_genes/all_circadian_variants.txt', sep='\t', header=True, index=False)

# Gene Enrichment <a class = 'anchor' id = 'geneenrichment'></a>

Let's subset data for our gene enrichment analyses. 

In [253]:
enrichment_data = data[['delta_max','annotation','distribution','Vernot_allele_origin','Browning_allele_origin']]

In [254]:
enrichment_data.to_csv('archaic_data_for_enrichment.txt', sep='\t', header=True, index=False)

We have plenty of GWAS and HPO terms. Let's make life a bit easier by categorizing them into systems starting with GWAS. Write a function to generate a new enrichment summary file with an added column that lists the system per trait.

In [255]:
def map_GWAS_systems():
    GWAS_mapping = pd.read_csv(f'../gene_enrichment/GWAS_terms_and_systems.txt', sep = '\t', header = 0)
    GWAS_system = pd.Series(GWAS_mapping['system'].values, index = GWAS_mapping['GWAS_term']).to_dict()
    setnames = ['altai_2','archaic_specific_2','Browning_introgressed_2','chagyrskaya_2','denisovan_2','neanderthal_2','shared_2','Vernot_introgressed_2','vindija_2']
    
    for set in setnames:
        dataframe = pd.read_csv(f'../gene_enrichment/enrichment/{set}_GWAS_enrichment.txt', sep = '\t', header = 0)
        dataframe['system'] = dataframe['label'].map(GWAS_system)
        dataframe.to_csv(f'../gene_enrichment/enrichment/{set}_GWAS_enrichment_with_system.txt', sep='\t', header=True, index=False)   

Run the function.

In [256]:
map_GWAS_systems()

Repeat with HPO.

In [257]:
def map_HPO_systems():
    HPO_mapping = pd.read_csv(f'../gene_enrichment/HPO_terms_and_systems.txt', sep = '\t', header = 0)
    HPO_system = pd.Series(HPO_mapping['system'].values, index = HPO_mapping['HPO_term']).to_dict()
    setnames = ['altai_2','archaic_specific_2','Browning_introgressed_2','chagyrskaya_2','denisovan_2','neanderthal_2','shared_2','Vernot_introgressed_2','vindija_2']
    
    for set in setnames:
        dataframe = pd.read_csv(f'../gene_enrichment/enrichment/{set}_HPO_enrichment.txt', sep = '\t', header = 0)
        dataframe['system'] = dataframe['label'].map(HPO_system)
        dataframe.to_csv(f'../gene_enrichment/enrichment/{set}_HPO_enrichment_with_system.txt', sep='\t', header=True, index=False)   

In [258]:
map_HPO_systems()

Now it's time to get FDR corrected p-values for our gene enrichment analyses. Let's write a function to start.

In [259]:
fdr_table = []

In [260]:
def reportFDRcorrectedPthreshold(set_name, ontology, q_value_threshold, resolution=0.0001, minStart=0):
    fdr_empiric = pd.read_csv(f'../gene_enrichment/empiric_FDR/{set_name}_{ontology}_empiric_FDR.txt', sep = '\t', header = None, index_col = 0)
    obs = pd.read_csv(f'../gene_enrichment/enrichment/{set_name}_{ontology}_enrichment.txt', sep = '\t')

    fdr_threshold = []
    for i in np.arange(minStart,0.05,resolution):
        
        observed_positive = sum(obs['p_value'] <= i)
        average_false_positive = (fdr_empiric <= i).sum().mean()
        q = average_false_positive/observed_positive
        fdr_threshold.append([set_name, ontology, q_value_threshold, i, observed_positive, average_false_positive, q])
        
        if (q != np.inf) & (q > q_value_threshold):
            break
    
    threshold = fdr_threshold[-2]
    fdr_table.append(threshold)
    #fdr_threshold = pd.DataFrame(fdr_threshold, columns = ['pval_threshold','obsPos','avgFalsePos','q'])
    #return fdr_threshold.tail(2).head(1)

Now to generate all the combinations for which to get a corrected p-value.

In [261]:
combinations = [(set_name,ontology,q_value_threshold) for set_name in ['altai_2','archaic_specific_2','Browning_introgressed_2','chagyrskaya_2','denisovan_2','neanderthal_2','shared_2','Vernot_introgressed_2','vindija_2'] for ontology in ['GWAS','HPO'] for q_value_threshold in [0.05,0.1]]

And run!

In [262]:
[reportFDRcorrectedPthreshold(set_name, ontology, q_value_threshold) for set_name, ontology, q_value_threshold in combinations]

  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average_false_positive/observed_positive
  q = average

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [263]:
fdr_table = pd.DataFrame(fdr_table, columns = ['set', 'ontology', 'q_value_threshold', 'p_value_threshold','observed_positive','average_false_positive','q'])
fdr_table

Unnamed: 0,set,ontology,q_value_threshold,p_value_threshold,observed_positive,average_false_positive,q
0,altai_2,GWAS,0.05,0.0029,15,0.542,0.036133
1,altai_2,GWAS,0.1,0.0089,25,2.493,0.09972
2,altai_2,HPO,0.05,0.0011,0,0.228,inf
3,altai_2,HPO,0.1,0.0011,0,0.228,inf
4,archaic_specific_2,GWAS,0.05,0.0019,10,0.336,0.0336
5,archaic_specific_2,GWAS,0.1,0.0029,11,0.732,0.066545
6,archaic_specific_2,HPO,0.05,0.0069,68,3.157,0.046426
7,archaic_specific_2,HPO,0.1,0.0199,119,11.679,0.098143
8,Browning_introgressed_2,GWAS,0.05,0.0009,4,0.0,0.0
9,Browning_introgressed_2,GWAS,0.1,0.0019,4,0.223,0.05575


# Gene Characteristics <a class = 'anchor' id = 'genecharacteristics'></a>

## N Exons, CDS Length, and Gene Length <a class = 'anchor' id = 'physical'></a>

Do genes with more exons or longer genes have more SAVs? Let's load the annotation data first.

In [264]:
annotation_header = ['name','chrom','strand','tx_start','tx_end','exon_start','exon_end']
annotations = pd.read_csv("../annotations/grch37_exon_annotations.txt", sep='\t', skiprows=1, names=annotation_header)
annotations.head(10)

Unnamed: 0,name,chrom,strand,tx_start,tx_end,exon_start,exon_end
0,OR4F5,chr1,+,69090,70008,69090,70008
1,AL627309.1,chr1,-,134900,139379,134900137620,135802139379
2,OR4F29,chr1,+,367639,368634,367639,368634
3,OR4F16,chr1,-,621095,622034,621095,622034
4,AL669831.1,chr1,-,738531,739137,738531738787739120,738618738812739137
5,AL645608.2,chr1,+,818042,819983,818042819495819960,818058819513819983
6,SAMD11,chr1,+,861117,879955,"861117,861301,865534,866418,871151,874419,8746...","861180,861393,865716,866469,871276,874509,8748..."
7,AL645608.1,chr1,-,861263,866445,861263863254865555865665865989866425,861406863261865660865719865996866445
8,NOC2L,chr1,-,879583,894670,"879583,880436,880897,881552,881781,883510,8838...","880180,880526,881033,881666,881925,883612,8839..."
9,KLHL17,chr1,+,895966,901095,"895966,896672,897008,897205,897734,898083,8984...","896180,896932,897130,897427,897851,898297,8986..."


Get the length.

In [265]:
len(annotations)

20274

Now let's count the number of exons using the delimiter in either of the last columns.

In [266]:
annotations['n_exons'] = annotations['exon_start'].str.count(',')
annotations.head(10)

Unnamed: 0,name,chrom,strand,tx_start,tx_end,exon_start,exon_end,n_exons
0,OR4F5,chr1,+,69090,70008,69090,70008,1
1,AL627309.1,chr1,-,134900,139379,134900137620,135802139379,2
2,OR4F29,chr1,+,367639,368634,367639,368634,1
3,OR4F16,chr1,-,621095,622034,621095,622034,1
4,AL669831.1,chr1,-,738531,739137,738531738787739120,738618738812739137,3
5,AL645608.2,chr1,+,818042,819983,818042819495819960,818058819513819983,3
6,SAMD11,chr1,+,861117,879955,"861117,861301,865534,866418,871151,874419,8746...","861180,861393,865716,866469,871276,874509,8748...",14
7,AL645608.1,chr1,-,861263,866445,861263863254865555865665865989866425,861406863261865660865719865996866445,6
8,NOC2L,chr1,-,879583,894670,"879583,880436,880897,881552,881781,883510,8838...","880180,880526,881033,881666,881925,883612,8839...",19
9,KLHL17,chr1,+,895966,901095,"895966,896672,897008,897205,897734,898083,8984...","896180,896932,897130,897427,897851,898297,8986...",12


Now the total length of the gene from the start of the first exon to the end of the last exon.

In [267]:
annotations['gene_length'] = annotations['tx_end'] - annotations['tx_start']

Now let's get the total exon length. Let's extract information as lists to make this easier and then bring everything back together. First, get a list of gene names copied by the number of exons present in that gene.

In [268]:
gene_names = annotations['name'].loc[annotations.index.repeat(annotations['n_exons'])].tolist()

Let's also get a list of chromosome names.

In [269]:
chroms = annotations['chrom'].loc[annotations.index.repeat(annotations['n_exons'])].tolist()

Then grab a list of exon starts and stops.

In [270]:
exon_starts = annotations['exon_start'].str.split(',').sum()
exon_stops = annotations['exon_end'].str.split(',').sum()

The trailing commas are creating blanks in the list so let's get rid of those.

In [271]:
exon_starts = list(filter(None, exon_starts))
exon_stops = list(filter(None, exon_stops))

Create a number dataframe and bring everything together.

In [272]:
exon_length = pd.DataFrame()
exon_length['name'] = gene_names
exon_length['exon_start'] = exon_starts
exon_length['exon_stop'] = exon_stops

Make sure it looks the way it should!

In [273]:
exon_length.head(10)

Unnamed: 0,name,exon_start,exon_stop
0,OR4F5,69090,70008
1,AL627309.1,134900,135802
2,AL627309.1,137620,139379
3,OR4F29,367639,368634
4,OR4F16,621095,622034
5,AL669831.1,738531,738618
6,AL669831.1,738787,738812
7,AL669831.1,739120,739137
8,AL645608.2,818042,818058
9,AL645608.2,819495,819513


Quick sidebar! This information would be very useful as annotation data. Let's construct a dataframe including the chromosome and save the result. We will use BED format so we need to subtract 1 from the exon start. Then we shall carry on!

In [274]:
hg19_exons = pd.DataFrame()
hg19_exons['chrom'] = chroms
hg19_exons['exon_start'] = exon_length['exon_start'].astype(int)-1
hg19_exons['exon_stop'] = exon_stops
hg19_exons['gene'] = gene_names

In [275]:
hg19_exons.head(10)

Unnamed: 0,chrom,exon_start,exon_stop,gene
0,chr1,69089,70008,OR4F5
1,chr1,134899,135802,AL627309.1
2,chr1,137619,139379,AL627309.1
3,chr1,367638,368634,OR4F29
4,chr1,621094,622034,OR4F16
5,chr1,738530,738618,AL669831.1
6,chr1,738786,738812,AL669831.1
7,chr1,739119,739137,AL669831.1
8,chr1,818041,818058,AL645608.2
9,chr1,819494,819513,AL645608.2


In [276]:
hg19_exons.to_csv('../annotations/hg19_exons.bed', sep="\t", header = False, index = False)

Now change the data type for the starts and stops and create a length column.

In [277]:
exon_length['exon_stop'] = exon_length['exon_stop'].astype(int)
exon_length['exon_start'] = exon_length['exon_start'].astype(int)
exon_length['exon_length'] = exon_length['exon_stop'] - exon_length['exon_start']

Check to see that it worked.

In [278]:
exon_length.head(10)

Unnamed: 0,name,exon_start,exon_stop,exon_length
0,OR4F5,69090,70008,918
1,AL627309.1,134900,135802,902
2,AL627309.1,137620,139379,1759
3,OR4F29,367639,368634,995
4,OR4F16,621095,622034,939
5,AL669831.1,738531,738618,87
6,AL669831.1,738787,738812,25
7,AL669831.1,739120,739137,17
8,AL645608.2,818042,818058,16
9,AL645608.2,819495,819513,18


Now let's group by the gene to get the total exon length.

In [279]:
exon_length = exon_length.groupby('name')['exon_length'].sum().to_frame('exon_length')

Create a new dataframe.

In [280]:
n_exons_length = annotations[['name','n_exons','gene_length']]
n_exons_length.head(10)

Unnamed: 0,name,n_exons,gene_length
0,OR4F5,1,918
1,AL627309.1,2,4479
2,OR4F29,1,995
3,OR4F16,1,939
4,AL669831.1,3,606
5,AL645608.2,3,1941
6,SAMD11,14,18838
7,AL645608.1,6,5182
8,NOC2L,19,15087
9,KLHL17,12,5129


In [281]:
n_exons_length = pd.merge(n_exons_length, exon_length['exon_length'], on='name')

Let's take a peek at the distributions of these variables.

In [282]:
n_exons_hist = n_exons_length.groupby('n_exons').size().to_frame('count')
n_exons_hist

Unnamed: 0_level_0,count
n_exons,Unnamed: 1_level_1
1,1597
2,1525
3,1482
4,1604
5,1517
6,1389
7,1152
8,1088
9,1024
10,858


In [283]:
gene_length_hist = n_exons_length.groupby('gene_length').size().to_frame('count')
gene_length_hist

Unnamed: 0_level_0,count
gene_length,Unnamed: 1_level_1
9,2
11,1
13,1
16,3
17,6
...,...
1987243,1
2056874,1
2057828,1
2092292,1


In [284]:
exon_length_hist = n_exons_length.groupby('exon_length').size().to_frame('count')
exon_length_hist

Unnamed: 0_level_0,count
exon_length,Unnamed: 1_level_1
9,2
11,1
13,1
16,3
17,6
...,...
30355,1
33679,1
34526,1
43816,1


Now to actually test their relationship to variant count at our two thresholds.

In [285]:
n_variants = data.rename(columns={'annotation':'name'})
n_variants = n_variants.groupby('name').size().to_frame('n_variants')
variant_characterstics = pd.merge(n_exons_length, n_variants, on='name', how='left')

In [286]:
n_splice_variants_2 = data_2.rename(columns={'annotation':'name'})
n_splice_variants_2 = n_splice_variants_2.groupby('name').size().to_frame('variant_count_2')
variant_characterstics = pd.merge(variant_characterstics, n_splice_variants_2, on='name', how='left')

In [287]:
n_splice_variants_5 = data_5.rename(columns={'annotation':'name'})
n_splice_variants_5 = n_splice_variants_5.groupby('name').size().to_frame('variant_count_5')
variant_characterstics = pd.merge(variant_characterstics, n_splice_variants_5, on='name', how='left')

In [288]:
variant_characterstics = variant_characterstics.fillna(0)
variant_characterstics.head(20)

Unnamed: 0,name,n_exons,gene_length,exon_length,n_variants,variant_count_2,variant_count_5
0,OR4F5,1,918,918,0.0,0.0,0.0
1,AL627309.1,2,4479,2661,0.0,0.0,0.0
2,OR4F29,1,995,995,0.0,0.0,0.0
3,OR4F16,1,939,939,0.0,0.0,0.0
4,AL669831.1,3,606,129,0.0,0.0,0.0
5,AL645608.2,3,1941,57,0.0,0.0,0.0
6,SAMD11,14,18838,2551,53.0,2.0,0.0
7,AL645608.1,6,5182,336,17.0,0.0,0.0
8,NOC2L,19,15087,2790,37.0,0.0,0.0
9,KLHL17,12,5129,2560,7.0,0.0,0.0


In [289]:
len(variant_characterstics)

20274

Save this dataframe for plotting.

In [290]:
variant_characterstics.to_csv('../dataframes/variant_characteristics.txt', sep="\t", header = True, index = False)

Subset to genes with at least one variant.

In [291]:
variant_characterstics = variant_characterstics[variant_characterstics['n_variants'] > 0]

In [292]:
len(variant_characterstics)

17631

Let's run some correlations on the characteristics.

In [293]:
rho, p = spearmanr(variant_characterstics['n_exons'], variant_characterstics['variant_count_2'])
print(rho,p)

0.3160848388955107 0.0


In [294]:
rho, p = spearmanr(variant_characterstics['exon_length'], variant_characterstics['variant_count_2'])
print(rho,p)

0.1851170800689867 1.0667367197057646e-135


In [295]:
rho, p = spearmanr(variant_characterstics['gene_length'], variant_characterstics['variant_count_2'])
print(rho,p)

0.2919566369123417 0.0


In [296]:
rho, p = spearmanr(variant_characterstics['n_exons'], variant_characterstics['variant_count_5'])
print(rho,p)

0.13299027146082146 2.2009095588908162e-70


In [297]:
rho, p = spearmanr(variant_characterstics['exon_length'], variant_characterstics['variant_count_5'])
print(rho,p)

0.0624799065089255 1.009030502488749e-16


In [298]:
rho, p = spearmanr(variant_characterstics['gene_length'], variant_characterstics['variant_count_5'])
print(rho,p)

0.1017201812950686 8.985235346295254e-42


## N Isoforms <a class = 'anchor' id = 'isoforms'></a>

Does the number of the known isoforms associate with N SAVs?

In [299]:
isoforms_header = ['name','isoform_count']
isoforms = pd.read_csv('../annotations/GENCODE_Release_40_hg38_N_isoforms.txt', sep = '\t', skiprows = 1, names = isoforms_header)
isoforms.head(10)

Unnamed: 0,name,isoform_count
0,MIR1302-2HG,2
1,FAM138A,2
2,OR4F5,1
3,ENSG00000238009,5
4,ENSG00000239945,1
5,ENSG00000239906,1
6,ENSG00000241860,6
7,ENSG00000241599,1
8,ENSG00000286448,1
9,ENSG00000236601,3


Merge with above dataframe. Let's use a new dataframe.

In [300]:
n_isoforms = pd.merge(variant_characterstics, isoforms, on = ['name'])
n_isoforms.head(10)

Unnamed: 0,name,n_exons,gene_length,exon_length,n_variants,variant_count_2,variant_count_5,isoform_count
0,SAMD11,14,18838,2551,53.0,2.0,0.0,15
1,NOC2L,19,15087,2790,37.0,0.0,0.0,6
2,KLHL17,12,5129,2560,7.0,0.0,0.0,4
3,PLEKHN1,16,8612,2404,21.0,1.0,0.0,5
4,PERM1,3,5825,3340,10.0,0.0,0.0,4
5,HES4,4,1148,899,2.0,0.0,0.0,4
6,ISG15,2,1118,711,7.0,0.0,0.0,3
7,AGRN,36,35994,7323,85.0,4.0,0.0,10
8,RNF223,2,3342,1902,3.0,0.0,0.0,1
9,C1orf159,10,34276,1841,52.0,0.0,0.0,19


In [301]:
len(n_isoforms)

16151

In [302]:
rho, p = spearmanr(n_isoforms['isoform_count'], n_isoforms['variant_count_2'])
print(rho,p)

0.1767337559003496 1.8670449850386394e-113


In [303]:
rho, p = spearmanr(n_isoforms['isoform_count'], n_isoforms['variant_count_5'])
print(rho,p)

0.07692978918607588 1.2368500958019458e-22


In [304]:
rho, p = spearmanr(n_isoforms['n_exons'], n_isoforms['isoform_count'])
print(rho,p)

0.5639685614316065 0.0


In [305]:
rho, p = spearmanr(n_isoforms['gene_length'], n_isoforms['variant_count_5'])
print(rho,p)

0.09741069465365773 2.3604050238515602e-35


Run a partial correlation to control for the number of exons, which seems to be the strongest covariate.

In [306]:
partial_corr(data = n_isoforms, x = 'isoform_count', y = 'variant_count_2', covar=['n_exons'], method = 'spearman')

Unnamed: 0,n,r,CI95%,p-val
spearman,16151,-0.003008,"[-0.02, 0.01]",0.702242


In [307]:
partial_corr(data = n_isoforms, x = 'isoform_count', y = 'variant_count_5', covar=['n_exons'], method = 'spearman')

Unnamed: 0,n,r,CI95%,p-val
spearman,16151,0.000817,"[-0.01, 0.02]",0.917351


## Constraint and Conservation <a class = 'anchor' id = 'constraintconservation'></a>

Let's check for differences in constraint and conservation starting with missense observed over expected. Start with Vernot.

In [308]:
Vernot_ancient_mis_oe = data_2.loc[data_2['Vernot_allele_origin'] == 'ancient', ['annotation', 'mis_oe']].drop_duplicates().dropna()
Vernot_archaic_specific_mis_oe = data_2.loc[data_2['Vernot_allele_origin'] == 'archaic-specific', ['annotation', 'mis_oe']].drop_duplicates().dropna()
Vernot_introgressed_mis_oe = data_2.loc[data_2['Vernot_allele_origin'] == 'introgressed', ['annotation', 'mis_oe']].drop_duplicates().dropna()
non_SA_mis_oe = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'mis_oe']].drop_duplicates().dropna()

In [309]:
len(Vernot_ancient_mis_oe)

1817

In [310]:
len(Vernot_archaic_specific_mis_oe)

1931

In [311]:
len(Vernot_introgressed_mis_oe)

228

In [312]:
len(non_SA_mis_oe)

12708

In [313]:
kruskal(Vernot_ancient_mis_oe['mis_oe'], Vernot_archaic_specific_mis_oe['mis_oe'], Vernot_introgressed_mis_oe['mis_oe'], non_SA_mis_oe['mis_oe'])

KruskalResult(statistic=18.885779825656478, pvalue=0.00028867568579468293)

Now for missense z-score.

In [314]:
Vernot_ancient_mis_z = data_2.loc[data_2['Vernot_allele_origin'] == 'ancient', ['annotation', 'mis_z']].drop_duplicates().dropna()
Vernot_archaic_specific_mis_z = data_2.loc[data_2['Vernot_allele_origin'] == 'archaic-specific', ['annotation', 'mis_z']].drop_duplicates().dropna()
Vernot_introgressed_mis_z = data_2.loc[data_2['Vernot_allele_origin'] == 'introgressed', ['annotation', 'mis_z']].drop_duplicates().dropna()
non_SA_mis_z = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'mis_z']].drop_duplicates().dropna()

In [315]:
len(Vernot_ancient_mis_z)

1817

In [316]:
len(Vernot_archaic_specific_mis_z)

1931

In [317]:
len(Vernot_introgressed_mis_z)

228

In [318]:
len(non_SA_mis_z)

12708

In [319]:
kruskal(Vernot_ancient_mis_z['mis_z'], Vernot_archaic_specific_mis_z['mis_z'], Vernot_introgressed_mis_z['mis_z'], non_SA_mis_z['mis_z'])

KruskalResult(statistic=8.069206499946368, pvalue=0.04460286080335209)

Now for LoF observed/expected.

In [320]:
Vernot_ancient_lof_oe = data_2.loc[data_2['Vernot_allele_origin'] == 'ancient', ['annotation', 'lof_oe']].drop_duplicates().dropna()
Vernot_archaic_specific_lof_oe = data_2.loc[data_2['Vernot_allele_origin'] == 'archaic-specific', ['annotation', 'lof_oe']].drop_duplicates().dropna()
Vernot_introgressed_lof_oe = data_2.loc[data_2['Vernot_allele_origin'] == 'introgressed', ['annotation', 'lof_oe']].drop_duplicates().dropna()
non_SA_lof_oe = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'lof_oe']].drop_duplicates().dropna()

In [321]:
len(Vernot_ancient_lof_oe)

1811

In [322]:
len(Vernot_archaic_specific_lof_oe)

1922

In [323]:
len(Vernot_introgressed_lof_oe)

228

In [324]:
len(non_SA_lof_oe)

12484

In [325]:
kruskal(Vernot_ancient_lof_oe['lof_oe'], Vernot_archaic_specific_lof_oe['lof_oe'], Vernot_introgressed_lof_oe['lof_oe'], non_SA_lof_oe['lof_oe'])

KruskalResult(statistic=1.6972263017314375, pvalue=0.6375506282903953)

Now for LoF z-scores.

In [326]:
Vernot_ancient_lof_z = data_2.loc[data_2['Vernot_allele_origin'] == 'ancient', ['annotation', 'lof_z']].drop_duplicates().dropna()
Vernot_archaic_specific_lof_z = data_2.loc[data_2['Vernot_allele_origin'] == 'archaic-specific', ['annotation', 'lof_z']].drop_duplicates().dropna()
Vernot_introgressed_lof_z = data_2.loc[data_2['Vernot_allele_origin'] == 'introgressed', ['annotation', 'lof_z']].drop_duplicates().dropna()
non_SA_lof_z = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'lof_z']].drop_duplicates().dropna()

In [327]:
len(Vernot_ancient_lof_z)

1811

In [328]:
len(Vernot_archaic_specific_lof_z)

1922

In [329]:
len(Vernot_introgressed_lof_z)

228

In [330]:
len(non_SA_lof_z)

12484

In [331]:
kruskal(Vernot_ancient_lof_z['lof_z'], Vernot_archaic_specific_lof_z['lof_z'], Vernot_introgressed_lof_z['lof_z'], non_SA_lof_z['lof_z'])

KruskalResult(statistic=322.37404245052227, pvalue=1.4283254818689705e-69)

And phyloP.

In [332]:
Vernot_ancient_phyloP = data_2.loc[data_2['Vernot_allele_origin'] == 'ancient', 'phyloP'].dropna()
Vernot_archaic_specific_phyloP = data_2.loc[data_2['Vernot_allele_origin'] == 'archaic-specific', 'phyloP'].dropna()
Vernot_introgressed_phyloP = data_2.loc[data_2['Vernot_allele_origin'] == 'introgressed', 'phyloP'].dropna()
non_SA_phyloP = data.loc[data['delta_max'] < 0.2, 'phyloP'].dropna()

In [333]:
len(Vernot_ancient_phyloP)

2251

In [334]:
len(Vernot_archaic_specific_phyloP)

2343

In [335]:
len(Vernot_introgressed_phyloP)

237

In [336]:
len(non_SA_phyloP)

1600192

In [337]:
kruskal(Vernot_ancient_phyloP, Vernot_archaic_specific_phyloP, Vernot_introgressed_phyloP, non_SA_phyloP)

KruskalResult(statistic=826.8618502507255, pvalue=6.462661238420673e-179)

Repeat for Browning.

In [338]:
Browning_ancient_mis_oe = data_2.loc[data_2['Browning_allele_origin'] == 'ancient', ['annotation', 'mis_oe']].drop_duplicates().dropna()
Browning_archaic_specific_mis_oe = data_2.loc[data_2['Browning_allele_origin'] == 'archaic-specific', ['annotation', 'mis_oe']].drop_duplicates().dropna()
Browning_introgressed_mis_oe = data_2.loc[data_2['Browning_allele_origin'] == 'introgressed', ['annotation', 'mis_oe']].drop_duplicates().dropna()
non_SA_mis_oe = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'mis_oe']].drop_duplicates().dropna()

In [339]:
len(Browning_ancient_mis_oe)

1780

In [340]:
len(Browning_archaic_specific_mis_oe)

1931

In [341]:
len(Browning_introgressed_mis_oe)

344

In [342]:
len(non_SA_mis_oe)

12708

In [343]:
kruskal(Browning_ancient_mis_oe['mis_oe'], Browning_archaic_specific_mis_oe['mis_oe'], Browning_introgressed_mis_oe['mis_oe'], non_SA_mis_oe['mis_oe'])

KruskalResult(statistic=21.310076146551417, pvalue=9.076442557005215e-05)

Now for missense z-score.

In [344]:
Browning_ancient_mis_z = data_2.loc[data_2['Browning_allele_origin'] == 'ancient', ['annotation', 'mis_z']].drop_duplicates().dropna()
Browning_archaic_specific_mis_z = data_2.loc[data_2['Browning_allele_origin'] == 'archaic-specific', ['annotation', 'mis_z']].drop_duplicates().dropna()
Browning_introgressed_mis_z = data_2.loc[data_2['Browning_allele_origin'] == 'introgressed', ['annotation', 'mis_z']].drop_duplicates().dropna()
non_SA_mis_z = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'mis_z']].drop_duplicates().dropna()

In [345]:
len(Browning_ancient_mis_z)

1780

In [346]:
len(Browning_archaic_specific_mis_z)

1931

In [347]:
len(Browning_introgressed_mis_z)

344

In [348]:
len(non_SA_mis_z)

12708

In [349]:
kruskal(Browning_ancient_mis_z['mis_z'], Browning_archaic_specific_mis_z['mis_z'], Browning_introgressed_mis_z['mis_z'], non_SA_mis_z['mis_z'])

KruskalResult(statistic=6.729237191405935, pvalue=0.08104737537983615)

Now for LoF observed/expected.

In [350]:
Browning_ancient_lof_oe = data_2.loc[data_2['Browning_allele_origin'] == 'ancient', ['annotation', 'lof_oe']].drop_duplicates().dropna()
Browning_archaic_specific_lof_oe = data_2.loc[data_2['Browning_allele_origin'] == 'archaic-specific', ['annotation', 'lof_oe']].drop_duplicates().dropna()
Browning_introgressed_lof_oe = data_2.loc[data_2['Browning_allele_origin'] == 'introgressed', ['annotation', 'lof_oe']].drop_duplicates().dropna()
non_SA_lof_oe = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'lof_oe']].drop_duplicates().dropna()

In [351]:
len(Browning_ancient_lof_oe)

1774

In [352]:
len(Browning_archaic_specific_lof_oe)

1922

In [353]:
len(Browning_introgressed_lof_oe)

344

In [354]:
len(non_SA_lof_oe)

12484

In [355]:
kruskal(Browning_ancient_lof_oe['lof_oe'], Browning_archaic_specific_lof_oe['lof_oe'], Browning_introgressed_lof_oe['lof_oe'], non_SA_lof_oe['lof_oe'])

KruskalResult(statistic=0.67329249199344, pvalue=0.879467110458323)

Now for LoF z-scores.

In [356]:
Browning_ancient_lof_z = data_2.loc[data_2['Browning_allele_origin'] == 'ancient', ['annotation', 'lof_z']].drop_duplicates().dropna()
Browning_archaic_specific_lof_z = data_2.loc[data_2['Browning_allele_origin'] == 'archaic-specific', ['annotation', 'lof_z']].drop_duplicates().dropna()
Browning_introgressed_lof_z = data_2.loc[data_2['Browning_allele_origin'] == 'introgressed', ['annotation', 'lof_z']].drop_duplicates().dropna()
non_SA_lof_z = data.loc[~data['annotation'].isin(data_2['annotation']), ['annotation', 'lof_z']].drop_duplicates().dropna()

In [357]:
len(Browning_ancient_lof_z)

1774

In [358]:
len(Browning_archaic_specific_lof_z)

1922

In [359]:
len(Browning_introgressed_lof_z)

344

In [360]:
len(non_SA_lof_z)

12484

In [361]:
kruskal(Browning_ancient_lof_z['lof_z'], Browning_archaic_specific_lof_z['lof_z'], Browning_introgressed_lof_z['lof_z'], non_SA_lof_z['lof_z'])

KruskalResult(statistic=314.5907082509769, pvalue=6.913256408935334e-68)

And phyloP.

In [362]:
Browning_ancient_phyloP = data_2.loc[data_2['Browning_allele_origin'] == 'ancient', 'phyloP'].dropna()
Browning_archaic_specific_phyloP = data_2.loc[data_2['Browning_allele_origin'] == 'archaic-specific', 'phyloP'].dropna()
Browning_introgressed_phyloP = data_2.loc[data_2['Browning_allele_origin'] == 'introgressed', 'phyloP'].dropna()
non_SA_phyloP = data.loc[data['delta_max'] < 0.2, 'phyloP'].dropna()

In [363]:
len(Browning_ancient_phyloP)

2194

In [364]:
len(Browning_archaic_specific_phyloP)

2343

In [365]:
len(Browning_introgressed_phyloP)

377

In [366]:
len(non_SA_phyloP)

1600192

In [367]:
kruskal(Browning_ancient_phyloP, Browning_archaic_specific_phyloP, Browning_introgressed_phyloP, non_SA_phyloP)

KruskalResult(statistic=765.2680739882495, pvalue=1.4742203520270613e-165)

# Gene Expression <a class = 'anchor' id = 'geneexpression'></a>

## Gene-Level <a class = 'anchor' id = 'genelevel'></a>

In which tissues are archaic splice variants that occur in modern humans expressed? We have data kindly provided by Mary Lauren Benton on expression specifity metrics. That dataframe contains Ensembl IDs so first we need to build a dictionary to get the GENCODE annotation.

In [368]:
genes = []
ids = []

with open("../annotations/gencode.v24lift37.basic.annotation.gff3") as f: 
    lines = f.readlines()
for line in lines:
	info=line.split(';')
	for x in info:
		if x.startswith("gene_id="):
			b=re.split('=|\.',x) # use re here because there are multiple delimiters (';' and '.')
			ids.append(b[1])
		elif x.startswith("gene_name="):
			c=x.split("=")
			genes.append(c[1])
            
ensembl_id_to_gene_id = dict(zip(ids, genes))

Load in Mary Lauren's data.

In [369]:
expression = pd.read_csv("../gene_expression/gene-specificity_GTEx_TPM_alltissues.tsv", sep='\t', header=0)
expression.head(10)

Unnamed: 0,Gene,adipose tissue,adrenal gland,amygdala,basal ganglia,breast,cerebellum,cerebral cortex,"cervix, uterine",colon,endometrium,esophagus,fallopian tube,heart muscle,hippocampal formation,hypothalamus,kidney,liver,lung,midbrain,ovary,pancreas,pituitary gland,prostate,salivary gland,skeletal muscle,skin,small intestine,spinal cord,spleen,stomach,testis,thyroid gland,urinary bladder,vagina,entropy,rel_entropy,tau
0,ENSG00000000003,27.4,15.5,7.3,7.7,32.0,2.5,5.6,30.6,32.6,22.8,35.5,27.7,4.5,7.0,9.9,16.2,23.8,12.0,6.5,50.3,7.6,36.5,17.5,30.7,2.0,8.8,16.2,6.4,7.8,8.8,49.5,18.7,37.6,23.8,0.06833,0.037445,0.639135
1,ENSG00000000005,20.7,0.0,0.1,0.0,11.9,0.0,0.1,0.2,0.6,0.1,0.0,0.3,0.1,0.1,0.1,0.5,0.0,0.0,0.4,1.3,0.0,0.0,0.4,0.4,1.3,3.1,0.5,0.0,0.1,0.0,0.1,0.2,0.1,0.2,0.534839,0.404403,0.974235
2,ENSG00000000419,33.6,37.7,12.4,15.6,34.0,24.9,22.4,35.8,31.5,39.2,33.8,36.5,21.6,14.6,17.4,20.5,17.3,31.5,14.1,33.2,18.2,28.4,29.9,28.0,26.2,30.5,28.6,16.7,34.0,24.7,42.7,35.2,32.7,34.7,0.013113,0.006362,0.364559
3,ENSG00000000457,5.4,4.5,1.5,1.7,6.7,5.2,2.3,7.9,5.4,6.5,5.7,6.3,2.0,1.7,2.1,3.4,3.8,5.1,1.7,5.9,2.5,4.5,6.8,6.7,3.6,6.7,5.4,2.5,5.7,3.8,6.3,7.4,9.1,6.6,0.028105,0.010124,0.48951
4,ENSG00000000460,0.7,0.5,0.2,0.2,0.7,1.5,0.2,0.9,0.6,0.7,0.7,0.5,0.2,0.2,0.3,0.3,0.5,0.5,0.3,0.6,0.2,0.5,0.7,0.6,0.1,0.6,0.6,0.5,0.9,0.4,4.9,0.8,1.0,0.7,0.104682,0.051871,0.984539
5,ENSG00000000938,19.9,4.8,2.7,3.2,13.7,2.4,3.8,9.1,5.0,8.6,3.8,18.7,3.7,3.2,3.7,4.4,4.2,86.1,4.9,2.4,1.7,2.7,6.2,5.8,1.3,3.7,8.2,6.5,124.3,4.1,3.2,7.6,7.1,6.7,0.268393,0.194874,0.933421
6,ENSG00000000971,122.4,27.3,5.0,7.5,54.3,3.5,5.5,30.9,26.3,12.5,30.7,27.5,163.3,5.8,6.0,19.1,286.4,73.2,10.1,166.9,4.3,17.5,24.2,107.1,9.9,71.9,41.9,11.3,9.0,22.1,11.5,117.8,97.4,26.6,0.170275,0.163749,0.855013
7,ENSG00000001036,27.4,35.1,7.1,7.8,24.8,5.9,6.3,32.5,23.3,26.6,11.1,27.7,14.0,7.9,10.7,20.7,15.2,27.3,10.5,26.8,12.7,16.8,16.3,21.7,3.0,11.5,23.9,17.6,22.4,18.8,11.1,26.6,24.9,20.7,0.031989,0.019567,0.497885
8,ENSG00000001084,11.0,9.9,7.6,9.8,9.1,5.7,8.6,9.6,13.9,8.4,18.0,10.5,3.1,8.2,8.1,6.1,12.6,7.0,9.6,6.0,3.7,5.5,12.1,10.7,3.7,8.6,15.6,10.9,15.8,11.0,5.2,12.4,44.5,14.3,0.043669,0.033442,0.787334
9,ENSG00000001167,9.9,7.4,6.0,5.9,10.9,11.2,6.3,12.4,7.2,10.9,8.7,11.3,4.7,6.4,6.0,5.7,4.5,12.7,6.8,14.1,4.3,9.9,9.6,8.3,4.5,8.6,8.9,10.6,10.4,6.1,42.3,17.6,11.2,9.9,0.041457,0.015504,0.793037


Now map our dictionary.

In [370]:
expression['annotation'] = expression['Gene'].map(ensembl_id_to_gene_id)
expression.head(10)

Unnamed: 0,Gene,adipose tissue,adrenal gland,amygdala,basal ganglia,breast,cerebellum,cerebral cortex,"cervix, uterine",colon,endometrium,esophagus,fallopian tube,heart muscle,hippocampal formation,hypothalamus,kidney,liver,lung,midbrain,ovary,pancreas,pituitary gland,prostate,salivary gland,skeletal muscle,skin,small intestine,spinal cord,spleen,stomach,testis,thyroid gland,urinary bladder,vagina,entropy,rel_entropy,tau,annotation
0,ENSG00000000003,27.4,15.5,7.3,7.7,32.0,2.5,5.6,30.6,32.6,22.8,35.5,27.7,4.5,7.0,9.9,16.2,23.8,12.0,6.5,50.3,7.6,36.5,17.5,30.7,2.0,8.8,16.2,6.4,7.8,8.8,49.5,18.7,37.6,23.8,0.06833,0.037445,0.639135,TSPAN6
1,ENSG00000000005,20.7,0.0,0.1,0.0,11.9,0.0,0.1,0.2,0.6,0.1,0.0,0.3,0.1,0.1,0.1,0.5,0.0,0.0,0.4,1.3,0.0,0.0,0.4,0.4,1.3,3.1,0.5,0.0,0.1,0.0,0.1,0.2,0.1,0.2,0.534839,0.404403,0.974235,TNMD
2,ENSG00000000419,33.6,37.7,12.4,15.6,34.0,24.9,22.4,35.8,31.5,39.2,33.8,36.5,21.6,14.6,17.4,20.5,17.3,31.5,14.1,33.2,18.2,28.4,29.9,28.0,26.2,30.5,28.6,16.7,34.0,24.7,42.7,35.2,32.7,34.7,0.013113,0.006362,0.364559,DPM1
3,ENSG00000000457,5.4,4.5,1.5,1.7,6.7,5.2,2.3,7.9,5.4,6.5,5.7,6.3,2.0,1.7,2.1,3.4,3.8,5.1,1.7,5.9,2.5,4.5,6.8,6.7,3.6,6.7,5.4,2.5,5.7,3.8,6.3,7.4,9.1,6.6,0.028105,0.010124,0.48951,SCYL3
4,ENSG00000000460,0.7,0.5,0.2,0.2,0.7,1.5,0.2,0.9,0.6,0.7,0.7,0.5,0.2,0.2,0.3,0.3,0.5,0.5,0.3,0.6,0.2,0.5,0.7,0.6,0.1,0.6,0.6,0.5,0.9,0.4,4.9,0.8,1.0,0.7,0.104682,0.051871,0.984539,C1orf112
5,ENSG00000000938,19.9,4.8,2.7,3.2,13.7,2.4,3.8,9.1,5.0,8.6,3.8,18.7,3.7,3.2,3.7,4.4,4.2,86.1,4.9,2.4,1.7,2.7,6.2,5.8,1.3,3.7,8.2,6.5,124.3,4.1,3.2,7.6,7.1,6.7,0.268393,0.194874,0.933421,FGR
6,ENSG00000000971,122.4,27.3,5.0,7.5,54.3,3.5,5.5,30.9,26.3,12.5,30.7,27.5,163.3,5.8,6.0,19.1,286.4,73.2,10.1,166.9,4.3,17.5,24.2,107.1,9.9,71.9,41.9,11.3,9.0,22.1,11.5,117.8,97.4,26.6,0.170275,0.163749,0.855013,CFH
7,ENSG00000001036,27.4,35.1,7.1,7.8,24.8,5.9,6.3,32.5,23.3,26.6,11.1,27.7,14.0,7.9,10.7,20.7,15.2,27.3,10.5,26.8,12.7,16.8,16.3,21.7,3.0,11.5,23.9,17.6,22.4,18.8,11.1,26.6,24.9,20.7,0.031989,0.019567,0.497885,FUCA2
8,ENSG00000001084,11.0,9.9,7.6,9.8,9.1,5.7,8.6,9.6,13.9,8.4,18.0,10.5,3.1,8.2,8.1,6.1,12.6,7.0,9.6,6.0,3.7,5.5,12.1,10.7,3.7,8.6,15.6,10.9,15.8,11.0,5.2,12.4,44.5,14.3,0.043669,0.033442,0.787334,GCLC
9,ENSG00000001167,9.9,7.4,6.0,5.9,10.9,11.2,6.3,12.4,7.2,10.9,8.7,11.3,4.7,6.4,6.0,5.7,4.5,12.7,6.8,14.1,4.3,9.9,9.6,8.3,4.5,8.6,8.9,10.6,10.4,6.1,42.3,17.6,11.2,9.9,0.041457,0.015504,0.793037,NFYA


In [371]:
len(expression)

18392

Let's reorder the columns.

In [372]:
expression = expression[['Gene','annotation','entropy','rel_entropy','tau','adipose tissue','adrenal gland','amygdala','basal ganglia','breast','cerebellum','cerebral cortex','cervix, uterine','colon','endometrium','esophagus','fallopian tube','heart muscle','hippocampal formation','hypothalamus','kidney','liver','lung','midbrain','ovary','pancreas','pituitary gland','prostate','salivary gland','skeletal muscle','skin','small intestine','spinal cord','spleen','stomach','testis','thyroid gland','urinary bladder','vagina']]
expression.head(10)

Unnamed: 0,Gene,annotation,entropy,rel_entropy,tau,adipose tissue,adrenal gland,amygdala,basal ganglia,breast,cerebellum,cerebral cortex,"cervix, uterine",colon,endometrium,esophagus,fallopian tube,heart muscle,hippocampal formation,hypothalamus,kidney,liver,lung,midbrain,ovary,pancreas,pituitary gland,prostate,salivary gland,skeletal muscle,skin,small intestine,spinal cord,spleen,stomach,testis,thyroid gland,urinary bladder,vagina
0,ENSG00000000003,TSPAN6,0.06833,0.037445,0.639135,27.4,15.5,7.3,7.7,32.0,2.5,5.6,30.6,32.6,22.8,35.5,27.7,4.5,7.0,9.9,16.2,23.8,12.0,6.5,50.3,7.6,36.5,17.5,30.7,2.0,8.8,16.2,6.4,7.8,8.8,49.5,18.7,37.6,23.8
1,ENSG00000000005,TNMD,0.534839,0.404403,0.974235,20.7,0.0,0.1,0.0,11.9,0.0,0.1,0.2,0.6,0.1,0.0,0.3,0.1,0.1,0.1,0.5,0.0,0.0,0.4,1.3,0.0,0.0,0.4,0.4,1.3,3.1,0.5,0.0,0.1,0.0,0.1,0.2,0.1,0.2
2,ENSG00000000419,DPM1,0.013113,0.006362,0.364559,33.6,37.7,12.4,15.6,34.0,24.9,22.4,35.8,31.5,39.2,33.8,36.5,21.6,14.6,17.4,20.5,17.3,31.5,14.1,33.2,18.2,28.4,29.9,28.0,26.2,30.5,28.6,16.7,34.0,24.7,42.7,35.2,32.7,34.7
3,ENSG00000000457,SCYL3,0.028105,0.010124,0.48951,5.4,4.5,1.5,1.7,6.7,5.2,2.3,7.9,5.4,6.5,5.7,6.3,2.0,1.7,2.1,3.4,3.8,5.1,1.7,5.9,2.5,4.5,6.8,6.7,3.6,6.7,5.4,2.5,5.7,3.8,6.3,7.4,9.1,6.6
4,ENSG00000000460,C1orf112,0.104682,0.051871,0.984539,0.7,0.5,0.2,0.2,0.7,1.5,0.2,0.9,0.6,0.7,0.7,0.5,0.2,0.2,0.3,0.3,0.5,0.5,0.3,0.6,0.2,0.5,0.7,0.6,0.1,0.6,0.6,0.5,0.9,0.4,4.9,0.8,1.0,0.7
5,ENSG00000000938,FGR,0.268393,0.194874,0.933421,19.9,4.8,2.7,3.2,13.7,2.4,3.8,9.1,5.0,8.6,3.8,18.7,3.7,3.2,3.7,4.4,4.2,86.1,4.9,2.4,1.7,2.7,6.2,5.8,1.3,3.7,8.2,6.5,124.3,4.1,3.2,7.6,7.1,6.7
6,ENSG00000000971,CFH,0.170275,0.163749,0.855013,122.4,27.3,5.0,7.5,54.3,3.5,5.5,30.9,26.3,12.5,30.7,27.5,163.3,5.8,6.0,19.1,286.4,73.2,10.1,166.9,4.3,17.5,24.2,107.1,9.9,71.9,41.9,11.3,9.0,22.1,11.5,117.8,97.4,26.6
7,ENSG00000001036,FUCA2,0.031989,0.019567,0.497885,27.4,35.1,7.1,7.8,24.8,5.9,6.3,32.5,23.3,26.6,11.1,27.7,14.0,7.9,10.7,20.7,15.2,27.3,10.5,26.8,12.7,16.8,16.3,21.7,3.0,11.5,23.9,17.6,22.4,18.8,11.1,26.6,24.9,20.7
8,ENSG00000001084,GCLC,0.043669,0.033442,0.787334,11.0,9.9,7.6,9.8,9.1,5.7,8.6,9.6,13.9,8.4,18.0,10.5,3.1,8.2,8.1,6.1,12.6,7.0,9.6,6.0,3.7,5.5,12.1,10.7,3.7,8.6,15.6,10.9,15.8,11.0,5.2,12.4,44.5,14.3
9,ENSG00000001167,NFYA,0.041457,0.015504,0.793037,9.9,7.4,6.0,5.9,10.9,11.2,6.3,12.4,7.2,10.9,8.7,11.3,4.7,6.4,6.0,5.7,4.5,12.7,6.8,14.1,4.3,9.9,9.6,8.3,4.5,8.6,8.9,10.6,10.4,6.1,42.3,17.6,11.2,9.9


And save the dataframe while we're at it.

In [373]:
expression.to_csv('../gene_expression/GTEX_gene_expression_with_gene_ids.txt', sep="\t", header=True, index=False)

Why don't we examine how gene expression is related to the number of SAVs. Remember our N splice variants dataframe from earlier?

In [374]:
n_splice_variants_2.reset_index(inplace=True)
n_splice_variants_2.head(10)

Unnamed: 0,name,variant_count_2
0,A1BG,1
1,AADACL4,1
2,AADAT,1
3,AAMP,2
4,AAR2,1
5,AARS2,1
6,AATF,1
7,AATK,1
8,ABAT,1
9,ABC7-42404400C24.1,1


In [375]:
entropy_N_SAVs = n_splice_variants_2.rename(columns={'name':'annotation'})
entropy_N_SAVs = pd.merge(expression[['annotation','rel_entropy']], entropy_N_SAVs, on='annotation', how='left').dropna()
entropy_N_SAVs.head(10)

Unnamed: 0,annotation,rel_entropy,variant_count_2
4,C1orf112,0.051871,1.0
8,GCLC,0.033442,1.0
15,CFTR,0.462603,1.0
16,ANKIB1,0.010808,1.0
21,LAP3,0.020556,1.0
24,AOC1,0.483678,1.0
26,HECW1,0.30024,1.0
27,MAD1L1,0.010812,1.0
28,LASP1,0.013782,1.0
30,TMEM176A,0.180257,1.0


In [376]:
entropy_bins = [-np.inf, 0.1, 0.5, np.inf]
entropy_N_SAVs['rel_entropy_bin'] = pd.cut(entropy_N_SAVs['rel_entropy'], entropy_bins, labels=['low', 'medium', 'high'])
entropy_N_SAVs.head(10)

Unnamed: 0,annotation,rel_entropy,variant_count_2,rel_entropy_bin
4,C1orf112,0.051871,1.0,low
8,GCLC,0.033442,1.0,low
15,CFTR,0.462603,1.0,medium
16,ANKIB1,0.010808,1.0,low
21,LAP3,0.020556,1.0,low
24,AOC1,0.483678,1.0,medium
26,HECW1,0.30024,1.0,medium
27,MAD1L1,0.010812,1.0,low
28,LASP1,0.013782,1.0,low
30,TMEM176A,0.180257,1.0,medium


In [377]:
len(entropy_N_SAVs)

4061

In [378]:
low_entropy = entropy_N_SAVs.loc[entropy_N_SAVs['rel_entropy_bin'] == 'low', 'variant_count_2']
med_entropy = entropy_N_SAVs.loc[entropy_N_SAVs['rel_entropy_bin'] == 'medium', 'variant_count_2']
high_entropy = entropy_N_SAVs.loc[entropy_N_SAVs['rel_entropy_bin'] == 'high', 'variant_count_2']

In [379]:
len(low_entropy)

2434

In [380]:
len(med_entropy)

1268

In [381]:
len(high_entropy)

359

In [382]:
kruskal(low_entropy, med_entropy, high_entropy)

KruskalResult(statistic=1.8914270893492804, pvalue=0.3884023295875893)

Now let's look at the relationship between gene expression and delta max.

In [383]:
entropy_delta = data_2[['annotation','delta_max']]
entropy_delta

Unnamed: 0,annotation,delta_max
38,SAMD11,0.22
79,SAMD11,0.29
145,PLEKHN1,0.29
170,AGRN,0.24
225,AGRN,0.22
...,...,...
1638703,EHMT1,0.33
1638715,EHMT1,0.31
1638752,CACNA1B,0.28
1639101,CACNA1B,0.25


Map on the relative entropy values.

In [384]:
entropy_delta = pd.merge(entropy_delta, expression[['annotation','rel_entropy']], on='annotation', how='left')
entropy_delta = entropy_delta.dropna(axis = 0)
entropy_delta.head(10)

Unnamed: 0,annotation,delta_max,rel_entropy
0,SAMD11,0.22,0.08947
1,SAMD11,0.29,0.08947
2,PLEKHN1,0.29,0.29562
3,AGRN,0.24,0.036305
4,AGRN,0.22,0.036305
5,AGRN,0.43,0.036305
6,AGRN,0.48,0.036305
7,TTLL10,0.27,0.411293
8,SDF4,0.38,0.008625
9,SDF4,0.52,0.008625


In [385]:
len(entropy_delta)

5700

In [386]:
entropy_delta['rel_entropy_bin'] = pd.cut(entropy_delta['rel_entropy'], entropy_bins, labels=['low', 'medium', 'high'])
entropy_delta.head(10)

Unnamed: 0,annotation,delta_max,rel_entropy,rel_entropy_bin
0,SAMD11,0.22,0.08947,low
1,SAMD11,0.29,0.08947,low
2,PLEKHN1,0.29,0.29562,medium
3,AGRN,0.24,0.036305,low
4,AGRN,0.22,0.036305,low
5,AGRN,0.43,0.036305,low
6,AGRN,0.48,0.036305,low
7,TTLL10,0.27,0.411293,medium
8,SDF4,0.38,0.008625,low
9,SDF4,0.52,0.008625,low


In [387]:
low_entropy = entropy_delta.loc[entropy_delta['rel_entropy_bin'] == 'low', 'delta_max']
med_entropy = entropy_delta.loc[entropy_delta['rel_entropy_bin'] == 'medium', 'delta_max']
high_entropy = entropy_delta.loc[entropy_delta['rel_entropy_bin'] == 'high', 'delta_max']

In [388]:
len(low_entropy)

3384

In [389]:
len(med_entropy)

1826

In [390]:
len(high_entropy)

490

In [391]:
kruskal(low_entropy, med_entropy, high_entropy)

KruskalResult(statistic=6.599476430123113, pvalue=0.03689282412288473)

## sQTLs <a class = 'anchor' id = 'sqtls'></a>

Now let's examine overlap between our variants and sQTLs from GTEx.

In [392]:
data.groupby('sQTL').size().to_frame('N')

Unnamed: 0_level_0,N
sQTL,Unnamed: 1_level_1
no,1359027
yes,248323


In [393]:
non_sQTLs = data[data['N_GTEx_tissues'].isnull()]
sQTLs = data[~data['N_GTEx_tissues'].isnull()]

In [394]:
len(non_sQTLs)

1359027

In [395]:
len(sQTLs)

248323

In [396]:
sQTLs['delta_max'].min()

0.0

In [397]:
sQTLs['delta_max'].max()

1.0

In [398]:
sQTLs['delta_max'].mean() - non_sQTLs['delta_max'].mean()

0.0025026787676935133

In [399]:
mannwhitneyu(non_sQTLs['delta_max'], sQTLs['delta_max'])

MannwhitneyuResult(statistic=160822988988.5, pvalue=0.0)

Let's save the delta max for sQTLs just in case.

In [400]:
sQTLs.to_csv('sQTLs.txt', sep="\t", header=True, index=False)

Now let's look at SAV sQTLs.

In [401]:
data_2.groupby('sQTL').size().to_frame('N')

Unnamed: 0_level_0,N
sQTL,Unnamed: 1_level_1
no,4569
yes,1381


In [402]:
data_2.groupby('N_GTEx_tissues').size().reset_index(name='N')

Unnamed: 0,N_GTEx_tissues,N
0,1.0,297
1,2.0,130
2,3.0,93
3,4.0,76
4,5.0,57
5,6.0,39
6,7.0,36
7,8.0,45
8,9.0,32
9,10.0,25


In [403]:
non_SAV_sQTLs = data_2[data_2['N_GTEx_tissues'].isnull()]
SAV_sQTLs = data_2[~data_2['N_GTEx_tissues'].isnull()]

In [404]:
len(non_SAV_sQTLs)

4569

In [405]:
len(SAV_sQTLs)

1381

What does their distribution look like?

In [406]:
SAV_sQTLs.groupby(['Vernot_allele_origin']).size().to_frame('N')

Unnamed: 0_level_0,N
Vernot_allele_origin,Unnamed: 1_level_1
ancient,1145
introgressed,50
low-confidence ancient,186


In [407]:
SAV_sQTLs.groupby(['Browning_allele_origin']).size().to_frame('N')

Unnamed: 0_level_0,N
Browning_allele_origin,Unnamed: 1_level_1
ancient,1113
introgressed,92
low-confidence ancient,176


Check the tissue distribution for tissue specific variants. 

In [408]:
tissues = data_2[['Adipose_Subcutaneous','Adipose_Visceral_Omentum','Adrenal_Gland','Artery_Aorta','Artery_Coronary','Artery_Tibial','Brain_Amygdala','Brain_Anterior_cingulate_cortex_BA24','Brain_Caudate_basal_ganglia','Brain_Cerebellar_Hemisphere','Brain_Cerebellum','Brain_Cortex','Brain_Frontal_Cortex_BA9','Brain_Hippocampus','Brain_Hypothalamus','Brain_Nucleus_accumbens_basal_ganglia','Brain_Putamen_basal_ganglia','Brain_Spinal_cord_cervical_c-1','Brain_Substantia_nigra','Breast_Mammary_Tissue','Cells_Cultured_fibroblasts','Cells_EBV-transformed_lymphocytes','Colon_Sigmoid','Colon_Transverse','Esophagus_Gastroesophageal_Junction','Esophagus_Mucosa','Esophagus_Muscularis','Heart_Atrial_Appendage','Heart_Left_Ventricle','Kidney_Cortex','Liver','Lung','Minor_Salivary_Gland','Muscle_Skeletal','Nerve_Tibial','Ovary','Pancreas','Pituitary','Prostate','Skin_Not_Sun_Exposed_Suprapubic','Skin_Sun_Exposed_Lower_leg','Small_Intestine_Terminal_Ileum','Spleen','Stomach','Testis','Thyroid','Uterus','Vagina','Whole_Blood','N_GTEx_tissues']]
tissues.head(10)

Unnamed: 0,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues
38,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
79,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
145,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
170,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
225,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,15.0
233,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,14.0
239,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,5.0
392,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
424,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0


In [409]:
tissues[tissues['N_GTEx_tissues'] < 3].sum().to_frame('N_sQTLs')

Unnamed: 0,N_sQTLs
Adipose_Subcutaneous,14.0
Adipose_Visceral_Omentum,12.0
Adrenal_Gland,1.0
Artery_Aorta,11.0
Artery_Coronary,1.0
Artery_Tibial,29.0
Brain_Amygdala,1.0
Brain_Anterior_cingulate_cortex_BA24,2.0
Brain_Caudate_basal_ganglia,2.0
Brain_Cerebellar_Hemisphere,5.0


In [410]:
data_2[data_2['N_GTEx_tissues'] > 40].groupby('Vernot_allele_origin').size().to_frame('N_sQTLs')

Unnamed: 0_level_0,N_sQTLs
Vernot_allele_origin,Unnamed: 1_level_1
ancient,107
low-confidence ancient,5


In [411]:
data_2[data_2['N_GTEx_tissues'] > 40].groupby('Browning_allele_origin').size().to_frame('N_sQTLs')

Unnamed: 0_level_0,N_sQTLs
Browning_allele_origin,Unnamed: 1_level_1
ancient,107
low-confidence ancient,5


In [412]:
data_2[data_2['Vernot_allele_origin'] == 'ancient'].groupby('N_GTEx_tissues').size().to_frame('N')

Unnamed: 0_level_0,N
N_GTEx_tissues,Unnamed: 1_level_1
1.0,225
2.0,98
3.0,76
4.0,60
5.0,48
6.0,30
7.0,32
8.0,39
9.0,25
10.0,21


In [413]:
data_2[data_2['Browning_allele_origin'] == 'ancient'].groupby('N_GTEx_tissues').size().to_frame('N')

Unnamed: 0_level_0,N
N_GTEx_tissues,Unnamed: 1_level_1
1.0,220
2.0,98
3.0,75
4.0,56
5.0,47
6.0,30
7.0,31
8.0,37
9.0,25
10.0,21


In [414]:
data_2[data_2['N_GTEx_tissues'] > 40].groupby('distribution').size().to_frame('N_sQTLs')

Unnamed: 0_level_0,N_sQTLs
distribution,Unnamed: 1_level_1
Altai,2
Chagyrskaya,3
Denisovan,8
Late Neanderthal,2
Neanderthal,9
Other,14
Shared,74


In [415]:
data_2[data_2['Vernot_allele_origin'] == 'introgressed'].groupby('N_GTEx_tissues').size().to_frame('N')

Unnamed: 0_level_0,N
N_GTEx_tissues,Unnamed: 1_level_1
1.0,12
2.0,10
3.0,1
4.0,2
5.0,4
6.0,2
7.0,1
9.0,2
15.0,1
16.0,2


In [416]:
data_2[data_2['Browning_allele_origin'] == 'introgressed'].groupby('N_GTEx_tissues').size().to_frame('N')

Unnamed: 0_level_0,N
N_GTEx_tissues,Unnamed: 1_level_1
1.0,20
2.0,12
3.0,2
4.0,7
5.0,6
6.0,4
7.0,3
8.0,2
9.0,2
11.0,4


Do ancient and introgressed sQTLs differ in the average number of tissues? First let's look at Vernot.

In [417]:
Vernot_ancient_SAV_sQTLs = SAV_sQTLs[SAV_sQTLs['Vernot_allele_origin'] == 'ancient']
Vernot_introgressed_SAV_sQTLs = SAV_sQTLs[SAV_sQTLs['Vernot_allele_origin'] == 'introgressed']

In [418]:
len(Vernot_ancient_SAV_sQTLs)

1145

In [419]:
len(Vernot_introgressed_SAV_sQTLs)

50

In [420]:
Vernot_ancient_SAV_sQTLs['N_GTEx_tissues'].mean() - Vernot_introgressed_SAV_sQTLs['N_GTEx_tissues'].mean()

4.7151965065502175

In [421]:
mannwhitneyu(Vernot_ancient_SAV_sQTLs['N_GTEx_tissues'], Vernot_introgressed_SAV_sQTLs['N_GTEx_tissues'])

MannwhitneyuResult(statistic=33629.5, pvalue=0.03532959447847716)

Now Browning.

In [422]:
Browning_ancient_SAV_sQTLs = SAV_sQTLs[SAV_sQTLs['Browning_allele_origin'] == 'ancient']
Browning_introgressed_SAV_sQTLs = SAV_sQTLs[SAV_sQTLs['Browning_allele_origin'] == 'introgressed']

In [423]:
len(Browning_ancient_SAV_sQTLs)

1113

In [424]:
len(Browning_introgressed_SAV_sQTLs)

92

In [425]:
Browning_ancient_SAV_sQTLs['N_GTEx_tissues'].mean() - Browning_introgressed_SAV_sQTLs['N_GTEx_tissues'].mean()

4.820998867143249

In [426]:
mannwhitneyu(Browning_ancient_SAV_sQTLs['N_GTEx_tissues'], Browning_introgressed_SAV_sQTLs['N_GTEx_tissues'])

MannwhitneyuResult(statistic=58133.5, pvalue=0.029857195030446277)

Let's take a quick peak at European allele frequencies and N sQTL SAVs given that a lot of GTEx samples come from individuals with European ancestry.

In [427]:
EUR_SAVs = data_2.groupby('1KG_EUR_AF').size().to_frame('N_SAVs')
EUR_SAVs = EUR_SAVs.reset_index()
EUR_SAVs = EUR_SAVs.sort_values('1KG_EUR_AF', ascending=False)
EUR_SAVs

Unnamed: 0,1KG_EUR_AF,N_SAVs
100,1.0,171
99,0.99,25
98,0.98,19
97,0.97,11
96,0.96,15
95,0.95,16
94,0.94,21
93,0.93,11
92,0.92,14
91,0.91,21


In [428]:
EUR_SAV_sQTLs = SAV_sQTLs.groupby('1KG_EUR_AF').size().to_frame('N_SAV_sQTLs')
EUR_SAV_sQTLs = EUR_SAV_sQTLs.reset_index()
EUR_SAV_sQTLs

Unnamed: 0,1KG_EUR_AF,N_SAV_sQTLs
0,0.0,49
1,0.01,22
2,0.02,29
3,0.03,28
4,0.04,29
5,0.05,26
6,0.06,23
7,0.07,19
8,0.08,22
9,0.09,25


In [429]:
EUR_sQTL_proportion = pd.merge(EUR_SAVs, EUR_SAV_sQTLs, on='1KG_EUR_AF')
bins = [0,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
EUR_sQTL_proportion['binned_AF'] = pd.cut(EUR_sQTL_proportion['1KG_EUR_AF'], bins)
EUR_sQTL_proportion = EUR_sQTL_proportion.groupby('binned_AF').sum().reset_index()
EUR_sQTL_proportion['proportion_sQTLs'] = (EUR_sQTL_proportion['N_SAV_sQTLs'])/(EUR_sQTL_proportion['N_SAVs'])
EUR_sQTL_proportion['binned_AF'] = EUR_sQTL_proportion['binned_AF'].astype(str)
EUR_sQTL_proportion

Unnamed: 0,binned_AF,1KG_EUR_AF,N_SAVs,N_SAV_sQTLs,proportion_sQTLs
0,"(0.0, 0.05]",0.15,447,134,0.299776
1,"(0.05, 0.1]",0.4,208,108,0.519231
2,"(0.1, 0.15]",0.65,148,84,0.567568
3,"(0.15, 0.2]",0.9,137,85,0.620438
4,"(0.2, 0.25]",1.15,110,68,0.618182
5,"(0.25, 0.3]",1.4,114,75,0.657895
6,"(0.3, 0.35]",1.65,89,61,0.685393
7,"(0.35, 0.4]",1.9,108,63,0.583333
8,"(0.4, 0.45]",2.15,97,62,0.639175
9,"(0.45, 0.5]",2.4,99,66,0.666667


In [430]:
EUR_sQTL_proportion.to_csv('../GTEx_sQTLs/EUR_sQTLs.txt', sep="\t", header=True, index=False)

# Purifying Selection <a class = 'anchor' id = 'purifyingselection'></a>

Now let's consider the role of purifying selection on SAVs. Start by writing a function that will use data from a dataframe to add columns with enrichment test information: the OR, p-value, and 95% CI.

In [431]:
def purifying_enrichment(df,A,B,C,D):
    odds_ratio = []
    p_value = []
    lower_CI = []
    upper_CI = []

    for i in range(len(df)):
        table = np.array([[df.loc[i, A],
                           df.loc[i, B]],
                          [df.loc[i, C],
                           df.loc[i, D]]])
        OR, p = fisher_exact(table)
    
        lCI = np.exp((np.log(OR)) - (1.96 * (sqrt((1/table[0,0]) + (1/table[0,1]) + (1/table[1,0]) + (1/table[1,1])))))
        uCI = np.exp((np.log(OR)) + (1.96 * (sqrt((1/table[0,0]) + (1/table[0,1]) + (1/table[1,1]) + (1/table[1,1])))))
    
        odds_ratio.append(OR)
        p_value.append(p)
        lower_CI.append(lCI)
        upper_CI.append(uCI)
    
    df['odds_ratio'] = odds_ratio
    df['p_value'] = p_value
    df['lower_CI'] = lower_CI
    df['upper_CI'] = upper_CI

## Varied Deltas <a class = 'anchor' id = 'varieddeltas'></a>

Now assemble the data starting with the sum of all unique counts.

In [432]:
all_unique_total = data[(data['distribution']=='Altai') | (data['distribution']=='Chagyrskaya') | (data['distribution']=='Denisovan') | (data['distribution']=='Vindija')]
all_unique_total_SA_2 = all_unique_total[(all_unique_total['delta_max']>=0.2)]
all_unique_total_SA_2_index = all_unique_total_SA_2.index
all_unique_total_non_SA_2 = all_unique_total.drop(all_unique_total_SA_2_index)

In [433]:
all_unique_total = data[(data['distribution']=='Altai') | (data['distribution']=='Chagyrskaya') | (data['distribution']=='Denisovan') | (data['distribution']=='Vindija')]
all_unique_total_SA_3 = all_unique_total[(all_unique_total['delta_max']>=0.3)]
all_unique_total_SA_3_index = all_unique_total_SA_3.index
all_unique_total_non_SA_3 = all_unique_total.drop(all_unique_total_SA_3_index)

In [434]:
all_unique_total = data[(data['distribution']=='Altai') | (data['distribution']=='Chagyrskaya') | (data['distribution']=='Denisovan') | (data['distribution']=='Vindija')]
all_unique_total_SA_4 = all_unique_total[(all_unique_total['delta_max']>=0.4)]
all_unique_total_SA_4_index = all_unique_total_SA_4.index
all_unique_total_non_SA_4 = all_unique_total.drop(all_unique_total_SA_4_index)

In [435]:
all_unique_total = data[(data['distribution']=='Altai') | (data['distribution']=='Chagyrskaya') | (data['distribution']=='Denisovan') | (data['distribution']=='Vindija')]
all_unique_total_SA_5 = all_unique_total[(all_unique_total['delta_max']>=0.5)]
all_unique_total_SA_5_index = all_unique_total_SA_5.index
all_unique_total_non_SA_5 = all_unique_total.drop(all_unique_total_SA_5_index)

Now for the shared SAVs. We'll use some of these again below.

In [436]:
all_shared = data[(data['distribution']=='Shared')]
all_shared_SAVs_2 = all_shared[(all_shared['delta_max']>=0.2)]
all_shared_SAVs_2_index = all_shared_SAVs_2.index
all_shared_non_SAVs_2 = all_shared.drop(all_shared_SAVs_2_index)

In [437]:
all_shared = data[(data['distribution']=='Shared')]
all_shared_SAVs_3 = all_shared[(all_shared['delta_max']>=0.3)]
all_shared_SAVs_3_index = all_shared_SAVs_3.index
all_shared_non_SAVs_3 = all_shared.drop(all_shared_SAVs_3_index)

In [438]:
all_shared = data[(data['distribution']=='Shared')]
all_shared_SAVs_4 = all_shared[(all_shared['delta_max']>=0.4)]
all_shared_SAVs_4_index = all_shared_SAVs_4.index
all_shared_non_SAVs_4 = all_shared.drop(all_shared_SAVs_4_index)

In [439]:
all_shared = data[(data['distribution']=='Shared')]
all_shared_SAVs_5 = all_shared[(all_shared['delta_max']>=0.5)]
all_shared_SAVs_5_index = all_shared_SAVs_5.index
all_shared_non_SAVs_5 = all_shared.drop(all_shared_SAVs_5_index)

Build a dataframe.

In [440]:
delta_threshold = ['0.2','0.3','0.4','0.5']
total_unique_SAVs = [len(all_unique_total_SA_2), len(all_unique_total_SA_3), len(all_unique_total_SA_4), len(all_unique_total_SA_5)]
total_unique_non_SAVs = [len(all_unique_total_non_SA_2), len(all_unique_total_non_SA_3), len(all_unique_total_non_SA_4), len(all_unique_total_non_SA_5)]
shared_SAVs = [len(all_shared_SAVs_2), len(all_shared_SAVs_3), len(all_shared_SAVs_4), len(all_shared_SAVs_5)]
shared_non_SAVs = [len(all_shared_non_SAVs_2), len(all_shared_non_SAVs_3), len(all_shared_non_SAVs_4), len(all_shared_non_SAVs_5)]

total_unique_enrichment = pd.DataFrame(list(zip(delta_threshold, total_unique_SAVs, total_unique_non_SAVs, shared_SAVs, shared_non_SAVs)),
                                          columns =['delta_threshold', 'N_total_unique_SAVs', 'N_total_unique_non_SAVs', 'N_shared_SAVs', 'N_shared_non_SAVs'])
total_unique_enrichment

Unnamed: 0,delta_threshold,N_total_unique_SAVs,N_total_unique_non_SAVs,N_shared_SAVs,N_shared_non_SAVs
0,0.2,2416,615666,1933,571264
1,0.3,1175,616907,945,572252
2,0.4,685,617397,533,572664
3,0.5,437,617645,328,572869


Run the enrichment tests.

In [441]:
purifying_enrichment(total_unique_enrichment,'N_total_unique_SAVs','N_shared_SAVs','N_total_unique_non_SAVs','N_shared_non_SAVs')
total_unique_enrichment

Unnamed: 0,delta_threshold,N_total_unique_SAVs,N_total_unique_non_SAVs,N_shared_SAVs,N_shared_non_SAVs,odds_ratio,p_value,lower_CI,upper_CI
0,0.2,2416,615666,1933,571264,1.15973,1e-06,1.09228,1.23135
1,0.3,1175,616907,945,572252,1.153383,0.001099,1.058636,1.256614
2,0.4,685,617397,533,572664,1.192062,0.002351,1.06441,1.335025
3,0.5,437,617645,328,572869,1.235731,0.003772,1.070824,1.426036


Save this dataframe to make plotting easier.

In [442]:
total_unique_enrichment.to_csv('total_unique_enrichment_tests.txt', sep = '\t', header = True, index = False)

## Lineage-Specific <a class = 'anchor' id = 'lineagespecific'></a>

What about lineage-specific effects? 

In [443]:
altai = data[(data['distribution'] == 'Altai')]
altai_SAVs_2 = altai[(altai['delta_max'] >= 0.2)]
altai_SAVs_2_index = altai_SAVs_2.index
altai_non_SAVs_2 = altai.drop(altai_SAVs_2_index)

In [444]:
altai = data[(data['distribution'] == 'Altai')]
altai_SAVs_5 = altai[(altai['delta_max'] >= 0.5)]
altai_SAVs_5_index = altai_SAVs_5.index
altai_non_SAVs_5 = altai.drop(altai_SAVs_5_index)

In [445]:
chagyrskaya = data[(data['distribution'] == 'Chagyrskaya')]
chagyrskaya_SAVs_2 = chagyrskaya[(chagyrskaya['delta_max'] >= 0.2)]
chagyrskaya_SAVs_2_index = chagyrskaya_SAVs_2.index
chagyrskaya_non_SAVs_2 = chagyrskaya.drop(chagyrskaya_SAVs_2_index)

In [446]:
chagyrskaya = data[(data['distribution'] == 'Chagyrskaya')]
chagyrskaya_SAVs_5 = chagyrskaya[(chagyrskaya['delta_max'] >= 0.5)]
chagyrskaya_SAVs_5_index = chagyrskaya_SAVs_5.index
chagyrskaya_non_SAVs_5 = chagyrskaya.drop(chagyrskaya_SAVs_5_index)

In [447]:
denisovan = data[(data['distribution'] == 'Denisovan')]
denisovan_SAVs_2 = denisovan[(denisovan['delta_max'] >= 0.2)]
denisovan_SAVs_2_index = denisovan_SAVs_2.index
denisovan_non_SAVs_2 = denisovan.drop(denisovan_SAVs_2_index)

In [448]:
denisovan = data[(data['distribution'] == 'Denisovan')]
denisovan_SAVs_5 = denisovan[(denisovan['delta_max'] >= 0.5)]
denisovan_SAVs_5_index = denisovan_SAVs_5.index
denisovan_non_SAVs_5 = denisovan.drop(denisovan_SAVs_5_index)

In [449]:
vindija = data[(data['distribution'] == 'Vindija')]
vindija_SAVs_2 = vindija[(vindija['delta_max'] >= 0.2)]
vindija_SAVs_2_index = vindija_SAVs_2.index
vindija_non_SAVs_2 = vindija.drop(vindija_SAVs_2_index)

In [450]:
vindija = data[(data['distribution'] == 'Vindija')]
vindija_SAVs_5 = vindija[(vindija['delta_max'] >= 0.5)]
vindija_SAVs_5_index = vindija_SAVs_5.index
vindija_non_SAVs_5 = vindija.drop(vindija_SAVs_5_index)

Build the dataframe.

In [451]:
delta_threshold = ['0.2','0.2','0.2','0.2','0.5','0.5','0.5','0.5']
lineage = ['altai','chagyrskaya','denisovan','vindija','altai','chagyrskaya','denisovan','vindija']
unique_SAVs = [len(altai_SAVs_2), len(chagyrskaya_SAVs_2), len(denisovan_SAVs_2), len(vindija_SAVs_2), len(altai_SAVs_5), len(chagyrskaya_SAVs_5), len(denisovan_SAVs_5), len(vindija_SAVs_5)]
unique_non_SAVs = [len(altai_non_SAVs_2), len(chagyrskaya_non_SAVs_2), len(denisovan_non_SAVs_2), len(vindija_non_SAVs_2), len(altai_non_SAVs_5), len(chagyrskaya_non_SAVs_5), len(denisovan_non_SAVs_5), len(vindija_non_SAVs_5)]
shared_SAVs = [len(all_shared_SAVs_2), len(all_shared_SAVs_2), len(all_shared_SAVs_2), len(all_shared_SAVs_2), len(all_shared_SAVs_5), len(all_shared_SAVs_5), len(all_shared_SAVs_5), len(all_shared_SAVs_5)]
shared_non_SAVs = [len(all_shared_non_SAVs_2), len(all_shared_non_SAVs_2), len(all_shared_non_SAVs_2), len(all_shared_non_SAVs_2), len(all_shared_non_SAVs_5), len(all_shared_non_SAVs_5), len(all_shared_non_SAVs_5), len(all_shared_non_SAVs_5)]

lineage_specific_enrichment = pd.DataFrame(list(zip(delta_threshold, lineage, unique_SAVs, unique_non_SAVs, shared_SAVs, shared_non_SAVs)),
                                          columns =['delta_threshold', 'lineage', 'N_unique_SAVs', 'N_unique_non_SAVs', 'N_shared_SAVs', 'N_shared_non_SAVs'])
lineage_specific_enrichment

Unnamed: 0,delta_threshold,lineage,N_unique_SAVs,N_unique_non_SAVs,N_shared_SAVs,N_shared_non_SAVs
0,0.2,altai,399,81517,1933,571264
1,0.2,chagyrskaya,218,53457,1933,571264
2,0.2,denisovan,1492,410000,1933,571264
3,0.2,vindija,307,70692,1933,571264
4,0.5,altai,75,81841,328,572869
5,0.5,chagyrskaya,35,53640,328,572869
6,0.5,denisovan,254,411238,328,572869
7,0.5,vindija,73,70926,328,572869


In [452]:
purifying_enrichment(lineage_specific_enrichment,'N_unique_SAVs','N_shared_SAVs','N_unique_non_SAVs','N_shared_non_SAVs')
lineage_specific_enrichment

Unnamed: 0,delta_threshold,lineage,N_unique_SAVs,N_unique_non_SAVs,N_shared_SAVs,N_shared_non_SAVs,odds_ratio,p_value,lower_CI,upper_CI
0,0.2,altai,399,81517,1933,571264,1.446538,1.152673e-10,1.29842,1.61125
1,0.2,chagyrskaya,218,53457,1933,571264,1.205194,0.0107887,1.047416,1.386416
2,0.2,denisovan,1492,410000,1933,571264,1.075449,0.03575391,1.005089,1.150713
3,0.2,vindija,307,70692,1933,571264,1.283433,8.655485e-05,1.137538,1.447754
4,0.5,altai,75,81841,328,572869,1.600559,0.0004931121,1.245305,2.056994
5,0.5,chagyrskaya,35,53640,328,572869,1.139622,0.4524517,0.80417,1.614854
6,0.5,denisovan,254,411238,328,572869,1.078753,0.3774332,0.915704,1.270824
7,0.5,vindija,73,70926,328,572869,1.797624,2.009186e-05,1.394727,2.31669


In [453]:
lineage_specific_enrichment.to_csv('lineage_specific_enrichment_tests.txt', sep = '\t', header = True, index = False)

# SAVs in Moderns <a class = 'anchor' id = 'SAVsinmoderns'></a>

## Introgressed SAV Distribution <a class = 'anchor' id = 'introgressedSAVdistribution'></a>

We know how splice altering variants appear to be distributed among archaics. But how are these distributed among the archaic variants? Let's use the archaic-specific variants as a background so we will assess those first.

In [454]:
archaic_specific_2 = data_2[data_2['Vernot_allele_origin']=='archaic-specific']
archaic_dist_2 = archaic_specific_2.groupby(['distribution']).size().to_frame('archaic_distribution').reset_index()
archaic_dist_2

Unnamed: 0,distribution,archaic_distribution
0,Altai,310
1,Chagyrskaya,172
2,Denisovan,956
3,Late Neanderthal,97
4,Neanderthal,268
5,Other,142
6,Shared,160
7,Vindija,238


In [455]:
archaic_dist_2 = archaic_dist_2.drop([2, 5, 6])
archaic_dist_2['prop'] = archaic_dist_2['archaic_distribution']/archaic_dist_2['archaic_distribution'].sum()
archaic_dist_2

Unnamed: 0,distribution,archaic_distribution,prop
0,Altai,310,0.285714
1,Chagyrskaya,172,0.158525
3,Late Neanderthal,97,0.089401
4,Neanderthal,268,0.247005
7,Vindija,238,0.219355


In [456]:
Vernot_introgressed_dist_2 = data_2[data_2['Vernot_allele_origin']=='introgressed'].groupby(['distribution']).size().to_frame('count').reset_index()
Vernot_introgressed_dist_2

Unnamed: 0,distribution,count
0,Altai,7
1,Denisovan,3
2,Neanderthal,141
3,Other,20
4,Shared,66


In [457]:
Vernot_introgressed_dist_2 = Vernot_introgressed_dist_2.drop([1, 3, 4])
Vernot_introgressed_dist_2['prop'] = Vernot_introgressed_dist_2['count'] / Vernot_introgressed_dist_2['count'].sum()
Vernot_introgressed_dist_2

Unnamed: 0,distribution,count,prop
0,Altai,7,0.047297
2,Neanderthal,141,0.952703


Repeat for delta >= 0.5.

In [458]:
archaic_specific_5 = data_5[data_5['Vernot_allele_origin']=='archaic-specific']
archaic_dist_5 = archaic_specific_5.groupby(['distribution']).size().to_frame('archaic_distribution').reset_index()
archaic_dist_5

Unnamed: 0,distribution,archaic_distribution
0,Altai,57
1,Chagyrskaya,27
2,Denisovan,173
3,Late Neanderthal,10
4,Neanderthal,42
5,Other,29
6,Shared,34
7,Vindija,57


In [459]:
archaic_dist_5 = archaic_dist_5.drop([2, 5, 6])
archaic_dist_5['prop'] = archaic_dist_5['archaic_distribution']/archaic_dist_5['archaic_distribution'].sum()
archaic_dist_5

Unnamed: 0,distribution,archaic_distribution,prop
0,Altai,57,0.295337
1,Chagyrskaya,27,0.139896
3,Late Neanderthal,10,0.051813
4,Neanderthal,42,0.217617
7,Vindija,57,0.295337


In [460]:
Vernot_introgressed_dist_5 = data_5[data_5['Vernot_allele_origin']=='introgressed'].groupby(['distribution']).size().to_frame('count').reset_index()
Vernot_introgressed_dist_5

Unnamed: 0,distribution,count
0,Altai,1
1,Denisovan,1
2,Neanderthal,36
3,Other,4
4,Shared,14


In [461]:
Vernot_introgressed_dist_5 = Vernot_introgressed_dist_5.drop([1, 3, 4])
Vernot_introgressed_dist_5['prop'] = Vernot_introgressed_dist_5['count'] / Vernot_introgressed_dist_5['count'].sum()
Vernot_introgressed_dist_5

Unnamed: 0,distribution,count,prop
0,Altai,1,0.027027
2,Neanderthal,36,0.972973


Now for Browning.

In [462]:
archaic_specific_2 = data_2[data_2['Vernot_allele_origin']=='archaic-specific']
archaic_dist_2 = archaic_specific_2.groupby(['distribution']).size().to_frame('archaic_distribution').reset_index()
archaic_dist_2

Unnamed: 0,distribution,archaic_distribution
0,Altai,310
1,Chagyrskaya,172
2,Denisovan,956
3,Late Neanderthal,97
4,Neanderthal,268
5,Other,142
6,Shared,160
7,Vindija,238


In [463]:
archaic_dist_2 = archaic_dist_2.drop([2, 5, 6])
archaic_dist_2['prop'] = archaic_dist_2['archaic_distribution']/archaic_dist_2['archaic_distribution'].sum()
archaic_dist_2

Unnamed: 0,distribution,archaic_distribution,prop
0,Altai,310,0.285714
1,Chagyrskaya,172,0.158525
3,Late Neanderthal,97,0.089401
4,Neanderthal,268,0.247005
7,Vindija,238,0.219355


In [464]:
Browning_introgressed_dist_2 = data_2[data_2['Browning_allele_origin']=='introgressed'].groupby(['distribution']).size().to_frame('count').reset_index()
Browning_introgressed_dist_2

Unnamed: 0,distribution,count
0,Altai,15
1,Denisovan,2
2,Neanderthal,203
3,Other,37
4,Shared,120


In [465]:
Browning_introgressed_dist_2 = Browning_introgressed_dist_2.drop([1, 3, 4])
Browning_introgressed_dist_2['prop'] = Browning_introgressed_dist_2['count'] / Browning_introgressed_dist_2['count'].sum()
Browning_introgressed_dist_2

Unnamed: 0,distribution,count,prop
0,Altai,15,0.068807
2,Neanderthal,203,0.931193


In [466]:
archaic_specific_5 = data_5[data_5['Vernot_allele_origin']=='archaic-specific']
archaic_dist_5 = archaic_specific_5.groupby(['distribution']).size().to_frame('archaic_distribution').reset_index()
archaic_dist_5

Unnamed: 0,distribution,archaic_distribution
0,Altai,57
1,Chagyrskaya,27
2,Denisovan,173
3,Late Neanderthal,10
4,Neanderthal,42
5,Other,29
6,Shared,34
7,Vindija,57


In [467]:
archaic_dist_5 = archaic_dist_5.drop([2, 5, 6])
archaic_dist_5['prop'] = archaic_dist_5['archaic_distribution']/archaic_dist_5['archaic_distribution'].sum()
archaic_dist_5

Unnamed: 0,distribution,archaic_distribution,prop
0,Altai,57,0.295337
1,Chagyrskaya,27,0.139896
3,Late Neanderthal,10,0.051813
4,Neanderthal,42,0.217617
7,Vindija,57,0.295337


In [468]:
Browning_introgressed_dist_5 = data_5[data_5['Browning_allele_origin']=='introgressed'].groupby(['distribution']).size().to_frame('count').reset_index()
Browning_introgressed_dist_5

Unnamed: 0,distribution,count
0,Altai,4
1,Denisovan,2
2,Neanderthal,36
3,Other,4
4,Shared,23


In [469]:
Browning_introgressed_dist_5 = Browning_introgressed_dist_5.drop([1, 3, 4])
Browning_introgressed_dist_5['prop'] = Browning_introgressed_dist_5['count'] / Browning_introgressed_dist_5['count'].sum()
Browning_introgressed_dist_5

Unnamed: 0,distribution,count,prop
0,Altai,4,0.1
2,Neanderthal,36,0.9


## Allele Frequency and Max Delta <a class = 'anchor' id = 'allelefrequencymaxdelta'></a>

If deltas represent the probability of a variant being splice altering, this metric may be related to allele frequency in modern human populations such that variants with small deltas may be high frequency because they are not actually splice altering.

In [470]:
present_in_1KG = data[data['present_in_1KG'] == 'yes']
present_in_1KG_SA_2 = data_2[data_2['present_in_1KG'] == 'yes']
present_in_1KG_SA_5 = data_5[data_5['present_in_1KG'] == 'yes']

In [471]:
rho, p = spearmanr(present_in_1KG['1KG_allele_frequency'], present_in_1KG['delta_max'])
print(rho,p)

-0.007352829220282604 8.637985526942276e-14


In [472]:
rho, p = spearmanr(present_in_1KG_SA_2['1KG_allele_frequency'], present_in_1KG_SA_2['delta_max'])
print(rho,p)

-0.00015112162504674333 0.9928437359082407


In [473]:
rho, p = spearmanr(present_in_1KG_SA_5['1KG_allele_frequency'], present_in_1KG_SA_5['delta_max'])
print(rho,p)

-0.1122804631529113 0.005817787511787715


Let's also consider ancient vs introgressed variants. First Vernot.

In [474]:
Vernot_ancient_variants = data[data['Vernot_allele_origin'] == 'ancient']
Vernot_introgressed_variants = data[data['Vernot_allele_origin'] == 'introgressed']

Vernot_ancient_variants_SA_2 = data_2[data_2['Vernot_allele_origin'] == 'ancient']
Vernot_introgressed_variants_SA_2 = data_2[data_2['Vernot_allele_origin'] == 'introgressed']

Vernot_ancient_variants_SA_5 = data_5[data_5['Vernot_allele_origin'] == 'ancient']
Vernot_introgressed_variants_SA_5 = data_5[data_5['Vernot_allele_origin'] == 'introgressed']

In [475]:
rho, p = spearmanr(Vernot_ancient_variants['1KG_allele_frequency'], Vernot_ancient_variants['delta_max'])
print(rho,p)

-0.003314513124591188 0.006217887126235556


In [476]:
rho, p = spearmanr(Vernot_introgressed_variants['Vernot_introgressed_AF'], Vernot_introgressed_variants['delta_max'])
print(rho,p)

0.0029179513598118316 0.4527678020639103


In [477]:
rho, p = spearmanr(Vernot_ancient_variants_SA_2['1KG_allele_frequency'], Vernot_ancient_variants_SA_2['delta_max'])
print(rho,p)

0.023032083134465274 0.27459969993034433


In [478]:
rho, p = spearmanr(Vernot_introgressed_variants_SA_2['Vernot_introgressed_AF'], Vernot_introgressed_variants_SA_2['delta_max'])
print(rho,p)

-0.23615138147102763 0.0002441666089658224


In [479]:
rho, p = spearmanr(Vernot_ancient_variants_SA_5['1KG_allele_frequency'], Vernot_ancient_variants_SA_5['delta_max'])
print(rho,p)

-0.12045161956844723 0.018365211572499436


In [480]:
rho, p = spearmanr(Vernot_introgressed_variants_SA_5['Vernot_introgressed_AF'], Vernot_introgressed_variants_SA_5['delta_max'])
print(rho,p)

0.005456816757488103 0.968161416494062


Now Browning.

In [481]:
Browning_ancient_variants = data[data['Browning_allele_origin'] == 'ancient']
Browning_introgressed_variants = data[data['Browning_allele_origin'] == 'introgressed']

Browning_ancient_variants_SA_2 = data_2[data_2['Browning_allele_origin'] == 'ancient']
Browning_introgressed_variants_SA_2 = data_2[data_2['Browning_allele_origin'] == 'introgressed']

Browning_ancient_variants_SA_5 = data_5[data_5['Browning_allele_origin'] == 'ancient']
Browning_introgressed_variants_SA_5 = data_5[data_5['Browning_allele_origin'] == 'introgressed']

In [482]:
rho, p = spearmanr(Browning_ancient_variants['1KG_allele_frequency'], Browning_ancient_variants['delta_max'])
print(rho,p)

-0.0014042305212856882 0.2501708002524762


In [483]:
rho, p = spearmanr(Browning_introgressed_variants['Browning_introgressed_AF'], Browning_introgressed_variants['delta_max'])
print(rho,p)

0.0035669425354435845 0.2694796095858049


In [484]:
rho, p = spearmanr(Browning_ancient_variants_SA_2['1KG_allele_frequency'], Browning_ancient_variants_SA_2['delta_max'])
print(rho,p)

0.018438954258859618 0.3878837383270327


In [485]:
rho, p = spearmanr(Browning_introgressed_variants_SA_2['Browning_introgressed_AF'], Browning_introgressed_variants_SA_2['delta_max'])
print(rho,p)

-0.11340802088388707 0.02768077570516449


In [486]:
rho, p = spearmanr(Browning_ancient_variants_SA_5['1KG_allele_frequency'], Browning_ancient_variants_SA_5['delta_max'])
print(rho,p)

-0.11314820835053853 0.02825198996162118


In [487]:
rho, p = spearmanr(Browning_introgressed_variants_SA_5['Browning_introgressed_AF'], Browning_introgressed_variants_SA_5['delta_max'])
print(rho,p)

-0.14422039040801918 0.2370900386665261


Let's examine allele frequency between superpopulations. 

First Vernot.

In [488]:
Vernot_ancient_SA_2_AFR = Vernot_ancient_variants_SA_2['1KG_non_ASW_AFR_AF']
Vernot_ancient_SA_2_AFR.mean()

0.5226508643821909

In [489]:
Vernot_ancient_SA_2_non_AFR = pd.concat([Vernot_ancient_variants_SA_2['1KG_AMR_AF'], Vernot_ancient_variants_SA_2['1KG_EAS_AF'], Vernot_ancient_variants_SA_2['1KG_EUR_AF'], Vernot_ancient_variants_SA_2['1KG_SAS_AF']], axis=0)
Vernot_ancient_SA_2_non_AFR.mean()

0.4761589698046181

In [490]:
mannwhitneyu(Vernot_ancient_SA_2_AFR, Vernot_ancient_SA_2_non_AFR)

MannwhitneyuResult(statistic=10963956.0, pvalue=2.6636825400212013e-09)

Let's use the Vernot AFs here that are for the archaic allele. Important to note that AFR may include ASW samples.

In [491]:
Vernot_introgressed_SA_2_AFR = Vernot_introgressed_variants_SA_2['Vernot_AFR_AF']
Vernot_introgressed_SA_2_AFR.mean()

0.0005103797468354429

In [492]:
Vernot_introgressed_SA_2_non_AFR = pd.concat([Vernot_introgressed_variants_SA_2['Vernot_AMR_AF'], Vernot_introgressed_variants_SA_2['Vernot_EAS_AF'], Vernot_introgressed_variants_SA_2['Vernot_EUR_AF'], Vernot_introgressed_variants_SA_2['Vernot_SAS_AF']], axis=0)
Vernot_introgressed_SA_2_non_AFR.mean()

0.028102088607594937

In [493]:
mannwhitneyu(Vernot_introgressed_SA_2_AFR, Vernot_introgressed_SA_2_non_AFR)

MannwhitneyuResult(statistic=41737.5, pvalue=6.418609721811313e-54)

Now Browning.

In [494]:
Browning_ancient_SA_2_AFR = Browning_ancient_variants_SA_2['1KG_non_ASW_AFR_AF']
Browning_ancient_SA_2_AFR.mean()

0.5356676302246378

In [495]:
Browning_ancient_SA_2_non_AFR = pd.concat([Browning_ancient_variants_SA_2['1KG_AMR_AF'], Browning_ancient_variants_SA_2['1KG_EAS_AF'], Browning_ancient_variants_SA_2['1KG_EUR_AF'], Browning_ancient_variants_SA_2['1KG_SAS_AF']], axis=0)
Browning_ancient_SA_2_non_AFR.mean()

0.4851947608200456

In [496]:
mannwhitneyu(Browning_ancient_SA_2_AFR, Browning_ancient_SA_2_non_AFR)

MannwhitneyuResult(statistic=10506748.0, pvalue=5.408902214406806e-11)

We need to do a little extra work for Browning to handle the introgressed reference variants. Let's start with AFR.

In [497]:
Browning_introgressed_SA_2_AFR_refs = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 0), '1KG_non_ASW_AFR_AF']
Browning_introgressed_SA_2_AFR_refs = 1-Browning_introgressed_SA_2_AFR_refs
Browning_introgressed_SA_2_AFR_refs.head(5)

237257    0.001949
744368    0.000975
Name: 1KG_non_ASW_AFR_AF, dtype: float64

In [498]:
Browning_introgressed_SA_2_AFR_alts = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 1), '1KG_non_ASW_AFR_AF']
Browning_introgressed_SA_2_AFR_alts.head(5)

10998    0.003899
14574    0.007797
17289    0.000000
17321    0.000000
19257    0.000000
Name: 1KG_non_ASW_AFR_AF, dtype: float64

In [499]:
Browning_introgressed_SA_2_AFR = pd.concat([Browning_introgressed_SA_2_AFR_refs, Browning_introgressed_SA_2_AFR_alts], ignore_index=True)
Browning_introgressed_SA_2_AFR.head(5)

0    0.001949
1    0.000975
2    0.003899
3    0.007797
4    0.000000
Name: 1KG_non_ASW_AFR_AF, dtype: float64

In [500]:
Browning_introgressed_SA_2_AFR.mean()

0.0014736221632773263

Now do AMR, EAS, EUR, and SAS.

In [501]:
Browning_introgressed_SA_2_AMR_refs = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 0), '1KG_AMR_AF']
Browning_introgressed_SA_2_EAS_refs = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 0), '1KG_EAS_AF']
Browning_introgressed_SA_2_EUR_refs = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 0), '1KG_EUR_AF']
Browning_introgressed_SA_2_SAS_refs = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 0), '1KG_SAS_AF']
Browning_introgressed_SA_2_non_AFR_refs = pd.concat([Browning_introgressed_SA_2_AMR_refs, Browning_introgressed_SA_2_EAS_refs, Browning_introgressed_SA_2_EUR_refs, Browning_introgressed_SA_2_SAS_refs], ignore_index=True)
Browning_introgressed_SA_2_non_AFR_refs = 1 - Browning_introgressed_SA_2_non_AFR_refs
Browning_introgressed_SA_2_non_AFR_refs.head(10)

0    0.25
1    0.07
2    0.00
3    0.10
4    0.17
5    0.14
6    0.03
7    0.04
dtype: float64

In [502]:
Browning_introgressed_SA_2_AMR_alts = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 1), '1KG_AMR_AF']
Browning_introgressed_SA_2_EAS_alts = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 1), '1KG_EAS_AF']
Browning_introgressed_SA_2_EUR_alts = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 1), '1KG_EUR_AF']
Browning_introgressed_SA_2_SAS_alts = Browning_introgressed_variants_SA_2.loc[(Browning_introgressed_variants_SA_2['Browning_ref_alt'] == 1), '1KG_SAS_AF']
Browning_introgressed_SA_2_non_AFR_alts = pd.concat([Browning_introgressed_SA_2_AMR_alts, Browning_introgressed_SA_2_EAS_alts, Browning_introgressed_SA_2_EUR_alts, Browning_introgressed_SA_2_SAS_alts], ignore_index=True)
Browning_introgressed_SA_2_non_AFR_alts.head(10)

0    0.00
1    0.06
2    0.01
3    0.02
4    0.02
5    0.00
6    0.00
7    0.00
8    0.01
9    0.00
dtype: float64

In [503]:
Browning_introgressed_SA_2_non_AFR = pd.concat([Browning_introgressed_SA_2_non_AFR_refs, Browning_introgressed_SA_2_non_AFR_alts], ignore_index=True)
Browning_introgressed_SA_2_non_AFR.mean()

0.03881299734748011

In [504]:
mannwhitneyu(Browning_introgressed_SA_2_AFR, Browning_introgressed_SA_2_non_AFR)

MannwhitneyuResult(statistic=153280.0, pvalue=2.2521777841236923e-48)

## Introgressed Genes <a class = 'anchor' id = 'introgressedgenes'></a>

Let's take a quick peek at the highest frequency introgressed genes.

In [505]:
SA_2_introgressed = data_2[data_2['Vernot_allele_origin']=='introgressed']
SA_2_introgressed_AF = SA_2_introgressed.sort_values(['Vernot_introgressed_AF'], ascending = [False])
SA_2_introgressed_AF.head(50)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
952426,chr20,62224595,G,A,G,derived,snv,0/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,1138.0,5096.0,0.22,0.51,0.21,0.01,0.11,0.31,0.000975,yes,G,A,1.0,0.03503,0.00099,0.10663,0.51488,0.21173,0.03704,0.30982,G,G,A,chr20_62195671_62229244,0.22881,yes,1.0,0.22,GMEB2,0.66048,2.202,0.049787,3.9462,-0.837,0.01,0.0,0.41,0.02,0.41,15,-44,2,-2,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,21.0,yes,introgressed,introgressed
248988,chr11,11507780,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,917.0,5096.0,0.18,0.14,0.31,0.02,0.21,0.28,0.003899,yes,G,A,1.0,0.05732,0.00397,0.21182,0.13988,0.31014,0.09259,0.28323,G,A,,chr11_11494384_11530287,0.189808,yes,1.0,0.18,GALNT18,0.87026,0.8799,0.38736,3.0252,-0.302,0.0,0.0,0.21,0.0,0.21,37,-14,2,-14,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,introgressed
1071943,chr4,38805942,G,C,G,derived,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,911.0,5096.0,0.18,0.39,0.22,0.01,0.15,0.16,0.001949,yes,G,C,1.0,0.04459,0.00198,0.14986,0.39484,0.22068,0.05556,0.1636,G,G,C,chr4_38294200_38857896,0.186192,yes,1.0,0.18,TLR1,1.0992,-0.70157,1.4112,-1.7272,-0.219,0.0,0.0,0.0,0.38,0.38,27,-15,1,-15,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,yes,introgressed,introgressed
1296126,chr6,52138226,C,G,C,derived,snv,0/1,0/0,0/0,0/0,True,False,False,False,Altai,yes,782.0,5096.0,0.15,0.24,0.17,0.0,0.09,0.29,0.001949,yes,C,G,1.0,0.01274,0.00198,0.09366,0.24206,0.166,0.09259,0.29448,C,C,G,chr6_52138225_52183126,0.159636,no,,0.15,MCM3,0.89914,0.80451,0.53523,2.884,0.563,0.21,0.0,0.0,0.0,0.21,-7,-40,-7,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,ancient
224953,chr10,121209372,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,690.0,5096.0,0.14,0.31,0.09,0.0,0.11,0.2,0.001949,yes,G,A,1.0,0.00318,0.00198,0.11095,0.31845,0.08946,0.11111,0.19632,G,A,,chr10_121002922_121217061,0.143432,yes,1.0,0.14,GRK5,0.68002,2.1583,0.20362,4.327,-0.355,0.23,0.0,0.0,0.0,0.23,-23,1,4,-14,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,yes,introgressed,introgressed
547177,chr15,85403496,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,682.0,5096.0,0.13,0.14,0.24,0.01,0.2,0.13,0.003899,yes,G,A,1.0,0.0414,0.00397,0.20029,0.13889,0.23559,0.0,0.12883,G,A,,chr15_84263508_85447079,0.141514,yes,1.0,0.13,ALPK3,0.98871,0.1314,0.7161,2.1538,0.462,0.0,0.0,0.33,0.0,0.33,15,50,15,-11,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,17.0,yes,introgressed,introgressed
1537600,chr8,99006787,C,T,C,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,569.0,5096.0,0.11,0.13,0.09,0.0,0.24,0.17,0.0,yes,C,T,1.0,0.00955,0.0,0.23919,0.12897,0.09344,0.05556,0.1728,C,T,,chr8_98910893_99084133,0.12688,yes,1.0,0.11,MATN2,0.79916,1.6689,0.6382,2.2207,-0.373,0.0,0.0,0.0,0.24,0.24,43,8,-2,43,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,introgressed
249339,chr11,11944265,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,591.0,5096.0,0.12,0.19,0.19,0.01,0.09,0.11,0.009747,yes,A,G,1.0,0.02866,0.00893,0.09222,0.18948,0.19682,0.18519,0.11043,A,G,,chr11_11691675_12000532,0.119576,yes,1.0,0.12,USP47,0.73421,2.4375,0.056706,7.3415,0.383,0.0,0.35,0.0,0.0,0.35,-13,18,-1,18,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,introgressed
600368,chr16,77412781,T,A,T,derived,snv,1/1,0/1,0/0,1/1,True,True,False,True,Neanderthal,yes,528.0,5096.0,0.1,0.24,0.12,0.0,0.04,0.12,0.0,yes,T,A,1.0,0.01274,0.0,0.04611,0.23909,0.12127,0.16667,0.12474,T,A,,chr16_77304807_77417155,0.106242,yes,1.0,0.1,ADAMTS18,1.383,-3.5082,0.7429,2.0685,0.455,0.0,0.0,0.41,0.0,0.41,23,-32,2,-32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,introgressed
1137481,chr4,163061051,T,C,T,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,510.0,5096.0,0.1,0.08,0.23,0.0,0.09,0.12,0.0,yes,T,C,1.0,0.01592,0.0,0.0879,0.07837,0.22962,0.14815,0.12372,T,C,,chr4_162928454_163168783,0.103922,yes,1.0,0.1,FSTL5,0.99361,0.047046,0.41999,3.4195,-0.415,0.0,0.0,0.2,0.0,0.2,1,-48,1,-48,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,yes,introgressed,introgressed


In [506]:
SA_2_introgressed = data_2[data_2['Browning_allele_origin']=='introgressed']
SA_2_introgressed_AF = SA_2_introgressed.sort_values(['Browning_introgressed_AF'], ascending = [False])
SA_2_introgressed_AF.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
906431,chr2,238970511,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,1274.0,5096.0,0.25,0.23,0.45,0.04,0.43,0.23,0.018519,no,,,,,,,,,,,,,,,,yes,1.0,0.25,UBE2F-SCLY,,,,,0.561,0.0,0.0,0.04,0.31,0.31,31,-5,-34,-5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,yes,ancient,introgressed
906433,chr2,238970511,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,1274.0,5096.0,0.25,0.23,0.45,0.04,0.43,0.23,0.018519,no,,,,,,,,,,,,,,,,yes,1.0,0.25,SCLY,1.0152,-0.091207,0.75947,1.023,0.561,0.0,0.0,0.09,0.46,0.46,31,-5,-34,-5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,yes,ancient,introgressed
906427,chr2,238970308,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,1274.0,5096.0,0.25,0.23,0.45,0.04,0.43,0.23,0.018519,no,,,,,,,,,,,,,,,,yes,1.0,0.25,SCLY,1.0152,-0.091207,0.75947,1.023,0.402,0.0,0.25,0.03,0.0,0.25,48,2,39,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,yes,ancient,introgressed
952426,chr20,62224595,G,A,G,derived,snv,0/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,1138.0,5096.0,0.22,0.51,0.21,0.01,0.11,0.31,0.000975,yes,G,A,1.0,0.03503,0.00099,0.10663,0.51488,0.21173,0.03704,0.30982,G,G,A,chr20_62195671_62229244,0.22881,yes,1.0,0.22,GMEB2,0.66048,2.202,0.049787,3.9462,-0.837,0.01,0.0,0.41,0.02,0.41,15,-44,2,-2,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,21.0,yes,introgressed,introgressed
1071945,chr4,38806019,A,G,G,ancestral,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,911.0,5096.0,0.18,0.39,0.22,0.01,0.15,0.16,0.001949,no,,,,,,,,,,,,,,,,yes,1.0,0.18,TLR1,1.0992,-0.70157,1.4112,-1.7272,-1.664,0.01,0.21,0.0,0.0,0.21,50,-16,-16,-28,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,yes,ancient,introgressed
248988,chr11,11507780,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,917.0,5096.0,0.18,0.14,0.31,0.02,0.21,0.28,0.003899,yes,G,A,1.0,0.05732,0.00397,0.21182,0.13988,0.31014,0.09259,0.28323,G,A,,chr11_11494384_11530287,0.189808,yes,1.0,0.18,GALNT18,0.87026,0.8799,0.38736,3.0252,-0.302,0.0,0.0,0.21,0.0,0.21,37,-14,2,-14,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,introgressed,introgressed
1071943,chr4,38805942,G,C,G,derived,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,911.0,5096.0,0.18,0.39,0.22,0.01,0.15,0.16,0.001949,yes,G,C,1.0,0.04459,0.00198,0.14986,0.39484,0.22068,0.05556,0.1636,G,G,C,chr4_38294200_38857896,0.186192,yes,1.0,0.18,TLR1,1.0992,-0.70157,1.4112,-1.7272,-0.219,0.0,0.0,0.0,0.38,0.38,27,-15,1,-15,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,7.0,yes,introgressed,introgressed
1071935,chr4,38802528,G,A,G,derived,snv,0/1,0/0,0/0,1/1,True,False,False,True,Other,yes,705.0,5096.0,0.14,0.27,0.17,0.01,0.12,0.15,0.001949,no,,,,,,,,,,,,,,,,yes,1.0,0.14,TLR1,1.0992,-0.70157,1.4112,-1.7272,-0.247,0.0,0.33,0.0,0.19,0.33,-50,41,0,-50,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,yes,ancient,introgressed
1449824,chr7,136608159,G,A,G,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,710.0,5096.0,0.14,0.04,0.36,0.02,0.21,0.12,0.008772,no,,,,,,,,,,,,,,,,yes,1.0,0.14,CHRM2,0.65445,1.92,0.27789,2.1986,-0.264,0.0,0.0,0.0,0.27,0.27,2,7,2,15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,introgressed
344527,chr12,27922074,T,C,T,derived,snv,1/1,1/1,1/1,1/1,True,True,True,True,Shared,yes,715.0,5096.0,0.14,0.2,0.24,0.02,0.11,0.16,0.011696,no,,,,,,,,,,,,,,,,yes,1.0,0.14,MANSC4,0.65585,1.6194,1.2109,-0.43512,0.454,0.0,0.22,0.0,0.0,0.22,-1,-37,26,-37,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,yes,ancient,introgressed


## ASE <a class = 'anchor' id = 'ASE'></a>

Is there overlap between introgressed variants implicated in allele-specific expression from McCoy et al. 2017 and our data?

In [507]:
mccoy_header = ['chrom','pos','ref_allele','alt_allele']
mccoy = pd.read_csv('../ASE/McCoy_et_al_2017_ASE_variants.txt', sep = '\t', names = mccoy_header)
mccoy.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele
0,chr9,34569494,G,T
1,chr10,126696061,T,G
2,chr11,8454637,A,C
3,chr1,222987816,A,T
4,chr2,107872056,T,C
5,chr13,86949835,C,G
6,chr17,8026078,C,G
7,chr4,149556752,G,A
8,chr16,86754480,C,T
9,chr8,130718756,C,G


In [508]:
mccoy_merge = pd.merge(data[['chrom','pos','ref_allele','alt_allele','delta_max','Vernot_allele_origin','Browning_allele_origin','annotation']], mccoy[['chrom','pos','ref_allele','alt_allele']], on=['chrom','pos','ref_allele','alt_allele'], indicator=True)
mccoy_merge.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,delta_max,Vernot_allele_origin,Browning_allele_origin,annotation,_merge
0,chr1,3383286,G,A,0.0,introgressed,introgressed,ARHGEF16,both
1,chr1,3411396,T,C,0.01,introgressed,introgressed,MEGF6,both
2,chr1,3411561,G,A,0.01,introgressed,introgressed,MEGF6,both
3,chr1,3411652,G,A,0.04,introgressed,introgressed,MEGF6,both
4,chr1,3551792,G,A,0.01,introgressed,introgressed,WRAP73,both
5,chr1,10368282,G,C,0.0,introgressed,introgressed,KIF1B,both
6,chr1,11905995,C,A,0.0,introgressed,introgressed,NPPA,both
7,chr1,12247128,C,T,0.0,introgressed,introgressed,TNFRSF1B,both
8,chr1,12260399,C,T,0.01,introgressed,introgressed,TNFRSF1B,both
9,chr1,12261613,T,C,0.0,introgressed,introgressed,TNFRSF1B,both


In [509]:
len(mccoy_merge)

862

In [510]:
mccoy_SAV = mccoy_merge[mccoy_merge['delta_max'] >= 0.2]

In [511]:
len(mccoy_SAV)

16

In [512]:
mccoy_SAV.head(16)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,delta_max,Vernot_allele_origin,Browning_allele_origin,annotation,_merge
22,chr1,22174518,G,T,0.98,introgressed,introgressed,HSPG2,both
36,chr1,55537474,C,G,0.33,introgressed,low-confidence ancient,USP24,both
46,chr1,161681848,C,T,0.2,introgressed,introgressed,FCRLA,both
90,chr1,212985592,G,A,0.52,introgressed,introgressed,TATDN3,both
170,chr11,86159859,G,A,0.26,introgressed,ancient,ME3,both
269,chr12,133272470,G,T,0.26,introgressed,introgressed,PXMP2,both
270,chr12,133272470,G,T,0.26,introgressed,introgressed,RP13-672B3.2,both
365,chr15,85403496,G,A,0.33,introgressed,introgressed,ALPK3,both
403,chr16,88924425,C,G,0.41,introgressed,introgressed,TRAPPC2L,both
449,chr19,40913595,G,A,0.23,low-confidence ancient,introgressed,PRX,both


# Genes of Evolutionary Significance <a class = 'anchor' id = 'evolutionarysignificantgenes'></a>

Let's look at SAVs among a few genes of evolutionary significance.

In [513]:
data_2[data_2['annotation']=='OAS1']

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
388691,chr12,113355275,G,T,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,1.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,no,,0.0,OAS1,1.0792,-0.42591,0.54623,1.7991,-0.137,0.26,0.0,0.0,0.0,0.26,10,1,10,-29,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient


In [514]:
data_2[data_2['annotation']=='EPAS1']

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
793935,chr2,46584859,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,12.0,5096.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,no,,,,,,,,,,,,,,,,no,,0.0,EPAS1,1.0198,-0.15408,0.21127,4.4975,-0.45,0.0,0.0,0.37,0.0,0.37,0,32,0,-24,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
793982,chr2,46610904,A,G,A,derived,snv,0/1,0/0,0/0,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,EPAS1,1.0198,-0.15408,0.21127,4.4975,-0.313,0.0,0.23,0.0,0.0,0.23,5,28,-18,4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


In [515]:
data_2[data_2['annotation']=='ERAP2']

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
1195764,chr5,96235896,A,G,A,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,2812.0,5096.0,0.55,0.53,0.52,0.6,0.58,0.53,0.590643,no,,,,,,,,,,,,,,,,no,,0.55,ERAP2,1.009,-0.07147,0.92859,0.46067,0.46,0.0,0.0,0.0,0.51,0.51,-37,-3,20,-3,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,32.0,yes,ancient,ancient
1195774,chr5,96238551,T,G,T,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,118.0,5096.0,0.02,0.0,0.06,0.0,0.05,0.02,0.000975,yes,T,G,1.0,0.00318,0.00099,0.05043,0.0,0.05368,0.0,0.02352,T,G,,chr5_95966719_96398083,0.025724,yes,1.0,0.02,ERAP2,1.009,-0.07147,0.92859,0.46067,0.455,0.0,0.0,0.53,0.0,0.53,-41,28,-5,41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,yes,introgressed,introgressed
1195802,chr5,96248413,A,G,A,derived,snv,0/1,0/0,0/0,0/0,True,False,False,False,Altai,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,ERAP2,1.009,-0.07147,0.92859,0.46067,-0.376,0.24,0.01,0.0,0.0,0.24,-26,2,47,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific


From McCoy et al. 2017

In [516]:
data[data['annotation']=='ADAMTSL3']

Unnamed: 0,chrom,pos,ref_allele,alt_allele,ancestral_allele,anc_dev,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,present_in_1KG,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,1KG_non_ASW_AFR_AF,Vernot_introgressed,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,Vernot_haplotype_tag,Vernot_introgressed_AF,Browning_introgressed,Browning_ref_alt,Browning_introgressed_AF,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,Adipose_Subcutaneous,Adipose_Visceral_Omentum,Adrenal_Gland,Artery_Aorta,Artery_Coronary,Artery_Tibial,Brain_Amygdala,Brain_Anterior_cingulate_cortex_BA24,Brain_Caudate_basal_ganglia,Brain_Cerebellar_Hemisphere,Brain_Cerebellum,Brain_Cortex,Brain_Frontal_Cortex_BA9,Brain_Hippocampus,Brain_Hypothalamus,Brain_Nucleus_accumbens_basal_ganglia,Brain_Putamen_basal_ganglia,Brain_Spinal_cord_cervical_c-1,Brain_Substantia_nigra,Breast_Mammary_Tissue,Cells_Cultured_fibroblasts,Cells_EBV-transformed_lymphocytes,Colon_Sigmoid,Colon_Transverse,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,Heart_Atrial_Appendage,Heart_Left_Ventricle,Kidney_Cortex,Liver,Lung,Minor_Salivary_Gland,Muscle_Skeletal,Nerve_Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin_Not_Sun_Exposed_Suprapubic,Skin_Sun_Exposed_Lower_leg,Small_Intestine_Terminal_Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole_Blood,N_GTEx_tissues,sQTL,Vernot_allele_origin,Browning_allele_origin
546574,chr15,84322854,T,G,T,derived,snv,0/0,0/0,./.,0/1,False,False,False,True,Vindija,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.334,0.00,0.00,0.0,0.00,0.00,-11,40,-1,-11,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
546575,chr15,84324441,G,A,G,derived,snv,0/0,0/1,0/0,0/1,False,True,False,True,Late Neanderthal,yes,743.0,5096.0,0.15,0.06,0.28,0.02,0.37,0.12,0.002924,no,,,,,,,,,,,,,,,,no,,0.15,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.561,0.00,0.00,0.0,0.00,0.00,40,5,13,-32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,yes,ancient,ancient
546576,chr15,84324778,C,T,C,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,-0.254,0.00,0.00,0.0,0.00,0.00,14,-31,-30,48,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
546577,chr15,84325970,C,T,C,derived,snv,1/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,yes,738.0,5096.0,0.14,0.06,0.28,0.02,0.37,0.11,0.002924,no,,,,,,,,,,,,,,,,yes,1.0,0.14,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.561,0.00,0.00,0.0,0.00,0.00,-2,38,3,-4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,ancient,introgressed
546578,chr15,84326162,G,A,G,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,no,,,,,,,,,,no,,,,,,,,,,,,,,,,no,,,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.561,0.00,0.05,0.0,0.00,0.05,11,-45,2,42,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,archaic-specific,archaic-specific
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
547008,chr15,84703470,C,T,C,derived,snv,1/1,0/0,0/0,0/1,True,False,False,True,Other,yes,670.0,5096.0,0.13,0.09,0.25,0.01,0.17,0.18,0.003899,yes,C,T,1.0,0.03822,0.00397,0.17291,0.08730,0.24751,0.0,0.18200,C,T,,chr15_84263508_84707851,0.138738,yes,1.0,0.13,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.491,0.01,0.00,0.0,0.00,0.01,14,-27,14,13,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,8.0,yes,introgressed,introgressed
547009,chr15,84703874,A,G,A,derived,snv,1/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,yes,646.0,5096.0,0.13,0.05,0.25,0.01,0.18,0.19,0.003899,no,,,,,,,,,,,,,,,,yes,1.0,0.13,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.334,0.00,0.00,0.0,0.00,0.00,-46,36,-5,-48,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,8.0,yes,ancient,introgressed
547010,chr15,84705645,C,G,C,derived,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,yes,19.0,5096.0,0.00,0.00,0.00,0.01,0.00,0.00,0.014620,no,,,,,,,,,,,,,,,,no,,0.00,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,-0.171,0.00,0.00,0.0,0.00,0.00,-5,9,-5,39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,no,low-confidence ancient,low-confidence ancient
547011,chr15,84706461,C,T,C,derived,snv,1/1,0/0,0/0,0/1,True,False,False,True,Other,yes,665.0,5096.0,0.13,0.05,0.28,0.01,0.18,0.18,0.003899,yes,C,T,1.0,0.03503,0.00397,0.17867,0.05357,0.27336,0.0,0.18200,C,T,,chr15_84263508_84707851,0.138314,yes,1.0,0.13,ADAMTSL3,0.95678,0.4755,0.5073,4.3946,0.591,0.00,0.19,0.0,0.01,0.19,22,-9,49,17,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,9.0,yes,introgressed,introgressed


# Average SAVs in 1KG <a class = 'anchor' id = 'avg1KGSAVs'></a>

Is the distribution of SAVs among the archaics unusual? Here, we import data on modern humans to which we have applied SpliceAI.

In [517]:
avg_SAV_header = ['chrom','pos','ref_allele','HG00137','HG00338','HG00619','HG01198','HG01281','HG01524','HG01802','HG02142','HG02345','HG02629','HG03060','HG03190','HG03708','HG03711','HG03800','HG04014','NA11830','NA18552','NA18868','NA19011','NA19452','NA19741','NA20537','NA21141','alt_allele','annotation','ag_delta','al_delta','dg_delta','dl_delta','ag_pos','al_pos','dg_pos','dl_pos']
avg_SAV = pd.read_csv('../thousand_genomes/thousand_genomes_avg_SAVs/spliceai_1KG_snvs.txt', sep = '\t', names = avg_SAV_header)
avg_SAV.head(10)

  avg_SAV = pd.read_csv('../thousand_genomes/thousand_genomes_avg_SAVs/spliceai_1KG_snvs.txt', sep = '\t', names = avg_SAV_header)


Unnamed: 0,chrom,pos,ref_allele,HG00137,HG00338,HG00619,HG01198,HG01281,HG01524,HG01802,HG02142,HG02345,HG02629,HG03060,HG03190,HG03708,HG03711,HG03800,HG04014,NA11830,NA18552,NA18868,NA19011,NA19452,NA19741,NA20537,NA21141,alt_allele,annotation,ag_delta,al_delta,dg_delta,dl_delta,ag_pos,al_pos,dg_pos,dl_pos
0,10,47460,A,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,C,TUBB8,0.0,0.02,0.0,0.0,10,-22,-43,-22
1,10,47754,T,./.,./.,./.,./.,0|1,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,C,TUBB8,0.09,0.0,0.0,0.0,-1,-2,-1,18
2,10,47792,A,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,0|1,./.,./.,G,TUBB8,0.0,0.0,0.0,0.0,34,-8,-20,-7
3,10,47804,T,./.,./.,./.,./.,./.,1|0,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,C,TUBB8,0.0,0.0,0.0,0.0,22,-20,-32,0
4,10,47876,C,1|0,0|1,./.,0|1,0|1,1|0,./.,0|1,./.,1|0,1|0,1|0,1|0,0|1,1|0,1|0,0|1,0|1,./.,./.,0|1,0|1,./.,0|1,T,TUBB8,0.03,0.0,0.0,0.0,-7,12,-6,-8
5,10,48005,G,1|0,0|1,./.,0|1,1|0,0|1,./.,0|1,./.,1|0,1|0,0|1,1|0,1|0,1|1,1|0,0|1,0|1,0|1,./.,0|1,0|1,./.,0|1,A,TUBB8,0.0,0.0,0.0,0.0,-7,-31,2,-31
6,10,48078,T,0|1,./.,./.,0|1,0|1,./.,./.,1|0,1|0,./.,./.,./.,0|1,0|1,./.,./.,./.,1|0,./.,./.,./.,./.,1|0,1|0,C,TUBB8,0.0,0.0,0.0,0.0,36,37,-27,37
7,10,48221,T,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,C,TUBB8,0.0,0.14,0.0,0.0,-1,5,5,-2
8,10,48232,G,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,0|1,./.,./.,./.,./.,./.,./.,./.,./.,./.,A,TUBB8,0.23,0.14,0.0,0.0,-6,50,-5,50
9,10,48249,C,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,1|0,./.,./.,T,TUBB8,0.14,0.31,0.0,0.0,33,-23,-47,33


In [518]:
avg_SAV = avg_SAV[(avg_SAV['chrom'] != 'X')]
len(avg_SAV)

4582422

In [519]:
avg_SAV = avg_SAV[(avg_SAV['chrom'] != 'X')]
avg_SAV = avg_SAV[(avg_SAV['ag_delta'] >= 0.2) | (avg_SAV['al_delta'] >= 0.2) | (avg_SAV['dg_delta'] >= 0.2) | (avg_SAV['dl_delta'] >= 0.2)]
avg_SAV.head(10)

Unnamed: 0,chrom,pos,ref_allele,HG00137,HG00338,HG00619,HG01198,HG01281,HG01524,HG01802,HG02142,HG02345,HG02629,HG03060,HG03190,HG03708,HG03711,HG03800,HG04014,NA11830,NA18552,NA18868,NA19011,NA19452,NA19741,NA20537,NA21141,alt_allele,annotation,ag_delta,al_delta,dg_delta,dl_delta,ag_pos,al_pos,dg_pos,dl_pos
8,10,48232,G,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,0|1,./.,./.,./.,./.,./.,./.,./.,./.,./.,A,TUBB8,0.23,0.14,0.0,0.0,-6,50,-5,50
9,10,48249,C,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,1|0,./.,./.,T,TUBB8,0.14,0.31,0.0,0.0,33,-23,-47,33
262,10,220176,A,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,./.,./.,./.,G,ZMYND11,0.0,0.0,0.27,0.0,0,39,-1,-32
417,10,288178,C,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,G,DIP2C,0.01,0.28,0.0,0.0,-38,0,-18,1
675,10,349332,T,1|0,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,C,DIP2C,0.0,0.0,0.12,0.21,-14,-2,-14,-1
1229,10,487391,A,1|1,1|1,1|1,1|0,1|1,1|1,1|1,1|0,1|1,0|1,./.,1|1,1|1,1|1,0|1,0|1,1|1,1|1,0|1,1|1,./.,1|1,1|1,1|1,G,DIP2C,0.0,0.0,0.03,0.46,-16,2,-16,2
1996,10,662757,T,./.,./.,0|1,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,0|1,./.,./.,C,DIP2C,0.0,0.0,0.45,0.0,1,-3,1,-3
2564,10,996086,A,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,1|0,./.,./.,./.,./.,./.,./.,./.,./.,G,GTPBP4,0.22,0.0,0.11,0.0,1,20,-49,49
2638,10,1008958,G,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,./.,0|1,./.,./.,./.,./.,./.,./.,./.,A,GTPBP4,0.07,0.37,0.0,0.0,25,0,25,0
2648,10,1010007,A,./.,./.,1|1,./.,0|1,./.,1|0,./.,./.,./.,./.,./.,./.,./.,./.,./.,0|1,./.,./.,./.,./.,1|1,./.,./.,G,GTPBP4,0.0,0.0,0.57,0.0,-40,17,-5,-28


In [520]:
len(avg_SAV)

14006

In [521]:
avg_SAV = avg_SAV.replace({'./.': 0, '0|1': 1, '1|0': 1, '1|1': 1})
avg_SAV.head(10)

Unnamed: 0,chrom,pos,ref_allele,HG00137,HG00338,HG00619,HG01198,HG01281,HG01524,HG01802,HG02142,HG02345,HG02629,HG03060,HG03190,HG03708,HG03711,HG03800,HG04014,NA11830,NA18552,NA18868,NA19011,NA19452,NA19741,NA20537,NA21141,alt_allele,annotation,ag_delta,al_delta,dg_delta,dl_delta,ag_pos,al_pos,dg_pos,dl_pos
8,10,48232,G,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,A,TUBB8,0.23,0.14,0.0,0.0,-6,50,-5,50
9,10,48249,C,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,T,TUBB8,0.14,0.31,0.0,0.0,33,-23,-47,33
262,10,220176,A,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,G,ZMYND11,0.0,0.0,0.27,0.0,0,39,-1,-32
417,10,288178,C,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,G,DIP2C,0.01,0.28,0.0,0.0,-38,0,-18,1
675,10,349332,T,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,C,DIP2C,0.0,0.0,0.12,0.21,-14,-2,-14,-1
1229,10,487391,A,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,G,DIP2C,0.0,0.0,0.03,0.46,-16,2,-16,2
1996,10,662757,T,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,C,DIP2C,0.0,0.0,0.45,0.0,1,-3,1,-3
2564,10,996086,A,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,G,GTPBP4,0.22,0.0,0.11,0.0,1,20,-49,49
2638,10,1008958,G,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,A,GTPBP4,0.07,0.37,0.0,0.0,25,0,25,0
2648,10,1010007,A,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,G,GTPBP4,0.0,0.0,0.57,0.0,-40,17,-5,-28


In [522]:
avg_SAV[['HG00137','HG00338','HG00619','HG01198','HG01281','HG01524','HG01802','HG02142','HG02345','HG02629','HG03060','HG03190','HG03708','HG03711','HG03800','HG04014','NA11830','NA18552','NA18868','NA19011','NA19452','NA19741','NA20537','NA21141']].sum(axis = 0, skipna = True)

HG00137    3266
HG00338    3344
HG00619    3320
HG01198    3514
HG01281    3493
HG01524    3363
HG01802    3363
HG02142    3349
HG02345    3385
HG02629    4120
HG03060    4160
HG03190    4142
HG03708    3368
HG03711    3379
HG03800    3435
HG04014    3421
NA11830    3200
NA18552    3382
NA18868    4102
NA19011    3321
NA19452    4078
NA19741    3284
NA20537    3168
NA21141    3487
dtype: int64

# Introgressed Reference Alleles <a class = 'anchor' id = 'introgressedrefs'></a>

Let's check if switching the reference and alternate alleles results in the same SpliceAI results for our introgressed variants. 

Load in the original dataframe where the introgressed variants are the reference alleles.

In [523]:
introgressed_ref_tag_header = ['chrom','pos','ref_allele','alt_allele','variant_type','altai_gt','chagyrskaya_gt','denisovan_gt','vindija_gt','altai_gt_boolean','chagyrskaya_gt_boolean','denisovan_gt_boolean','vindija_gt_boolean','distribution','annotation','mis_oe','mis_z','lof_oe','lof_z','phyloP','ag_delta','al_delta','dg_delta','dl_delta','delta_max','ag_pos','al_pos','dg_pos','dl_pos','ancestral_allele','temp','anc_dev','1KG_allele_count','1KG_allele_number','1KG_allele_frequency','1KG_EAS_AF','1KG_EUR_AF','1KG_AFR_AF','1KG_AMR_AF','1KG_SAS_AF','present_in_1KG','1KG_non_ASW_AFR_AF','start','Vernot_ancestral_allele','Vernot_derived_allele','Vernot_ancestral_derived_code','Vernot_AFA_AF','Vernot_AFR_AF','Vernot_AMR_AF','Vernot_EAS_AF','Vernot_EUR_AF','Vernot_PNG_AF','Vernot_SAS_AF','Vernot_Denisovan_base','Vernot_haplotype_tag','Vernot_Neanderthal_base_1','Vernot_Neanderthal_base_2']
introgressed_ref_tag = pd.read_csv('introgressed_ref_tag.txt', sep = '\t', names = introgressed_ref_tag_header)
introgressed_ref_tag.head(10)

Unnamed: 0,chrom,pos,ref_allele,alt_allele,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,ancestral_allele,temp,anc_dev,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,present_in_1KG,1KG_non_ASW_AFR_AF,start,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_haplotype_tag,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2
0,chr1,3046711,T,G,snv,0/1,0/0,0/0,1/1,True,False,False,True,Other,PRDM16,0.87043,1.3471,0.081747,5.9522,0.379,0.0,0.0,0.0,0.0,0.0,-4,35,-24,19,T,T,derived,112.0,5096.0,0.02,0.0,0.06,0.0,0.05,0.01,yes,0.0,3046710.0,T,G,1.0,0.01274,0.0,0.04611,0.0,0.06362,0.0,0.01125,T,chr1_3015134_3046826,T,G
1,chr1,3046826,G,T,snv,0/1,0/0,0/0,1/1,True,False,False,True,Other,PRDM16,0.87043,1.3471,0.081747,5.9522,0.462,0.0,0.0,0.0,0.0,0.0,-49,-1,11,-1,G,G,derived,112.0,5096.0,0.02,0.0,0.06,0.0,0.05,0.01,yes,0.0,3046825.0,G,T,1.0,0.01274,0.0,0.04611,0.0,0.06362,0.0,0.01125,G,chr1_3015134_3046826,G,T
2,chr1,3188156,G,A,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,0.491,0.0,0.0,0.0,0.0,0.0,-27,9,4,-1,G,G,derived,301.0,5096.0,0.06,0.18,0.0,0.0,0.04,0.08,yes,0.0,3188155.0,G,A,1.0,0.00318,0.0,0.04323,0.18254,0.00298,0.07407,0.07975,G,chr1_3066509_3209504,G,A
3,chr1,3199291,C,T,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,0.297,0.0,0.0,0.0,0.0,0.0,28,30,28,-23,C,C,derived,447.0,5096.0,0.09,0.25,0.01,0.0,0.13,0.1,yes,0.0,3199290.0,C,T,1.0,0.00955,0.0,0.1268,0.24603,0.00596,0.11111,0.09509,C,chr1_3066509_3209504,C,T
4,chr1,3199727,C,T,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,0.436,0.0,0.0,0.0,0.0,0.0,8,44,-2,44,C,C,derived,447.0,5096.0,0.09,0.25,0.01,0.0,0.13,0.1,yes,0.0,3199726.0,C,T,1.0,0.00955,0.0,0.1268,0.24603,0.00596,0.11111,0.09509,C,chr1_3066509_3209504,C,T
5,chr1,3209504,G,A,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.438,0.0,0.0,0.0,0.0,0.0,20,1,-12,-1,G,G,derived,444.0,5096.0,0.09,0.24,0.01,0.0,0.13,0.09,yes,0.0,3209503.0,G,A,1.0,0.00955,0.0,0.1268,0.24405,0.00596,0.11111,0.09305,G,chr1_3066509_3209504,G,A
6,chr1,3210305,T,A,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.379,0.0,0.0,0.0,0.0,0.0,2,7,-4,-21,T,T,derived,443.0,5096.0,0.09,0.25,0.01,0.0,0.13,0.09,yes,0.0,3210304.0,T,A,1.0,0.00955,0.0,0.1268,0.24405,0.00596,0.11111,0.091,T,chr1_3166567_3212428,T,A
7,chr1,3212428,A,C,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.445,0.0,0.0,0.0,0.01,0.01,3,11,23,11,A,A,derived,452.0,5096.0,0.09,0.26,0.01,0.0,0.13,0.09,yes,0.0,3212427.0,A,C,1.0,0.00955,0.0,0.13112,0.25595,0.00696,0.11111,0.08384,A,chr1_3166567_3212428,A,C
8,chr1,3240652,C,T,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.479,0.0,0.0,0.0,0.0,0.0,10,50,29,9,C,C,derived,26.0,5096.0,0.01,0.0,0.0,0.0,0.0,0.02,yes,0.004873,3240651.0,C,T,1.0,0.00318,0.00397,0.00144,0.0,0.0,0.18519,0.02045,T,chr1_3195401_3250112,C,
9,chr1,3240725,T,C,snv,0/1,0/0,0/0,0/1,True,False,False,True,Other,PRDM16,0.87043,1.3471,0.081747,5.9522,-1.193,0.0,0.0,0.0,0.0,0.0,-39,29,40,-2,T,T,derived,394.0,5096.0,0.08,0.16,0.01,0.0,0.14,0.12,yes,0.0,3240724.0,T,C,1.0,0.00955,0.0,0.13833,0.16171,0.00596,0.11111,0.11656,T,chr1_3240724_3259500,T,C


In [524]:
len(introgressed_ref_tag)

7977

In [525]:
SAV_introgressed_ref_tag = introgressed_ref_tag[introgressed_ref_tag['delta_max']>=0.2]
len(SAV_introgressed_ref_tag)

26

In [526]:
SAV_introgressed_ref_tag

Unnamed: 0,chrom,pos,ref_allele,alt_allele,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,ancestral_allele,temp,anc_dev,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,present_in_1KG,1KG_non_ASW_AFR_AF,start,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_haplotype_tag,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2
295,chr1,52820680,C,T,snv,0/1,0/0,0/0,0/1,True,False,False,True,Other,CC2D1B,1.1262,-1.0089,0.69443,2.0388,-0.14,0.23,0.0,0.0,0.01,0.23,-22,-40,-22,34,C,C,derived,90.0,5096.0,0.02,0.03,0.0,0.0,0.0,0.05,yes,0.0,52820679.0,C,T,1.0,0.00637,0.0,0.0,0.02877,0.00497,0.0,0.05521,C,chr1_52325475_53062486,C,T
476,chr1,152944501,A,T,snv,0/1,0/0,0/0,0/1,True,False,False,True,Other,SPRR4,1.4955,-1.106,0.71438,0.44285,0.383,0.23,0.0,0.0,0.0,0.23,13,2,12,-1,A,A,derived,2.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,yes,0.0,152944500.0,A,T,1.0,0.0,0.0,0.0,0.00198,0.0,0.0,0.0,A,chr1_152942848_153033467,A,T
557,chr1,183750131,C,A,snv,0/1,0/0,0/0,0/0,True,False,False,False,Altai,RGL1,0.70413,2.2238,0.17753,5.1161,0.655,0.0,0.0,0.26,0.01,0.26,-7,-11,-7,-11,C,C,derived,50.0,5096.0,0.01,0.05,0.0,0.0,0.0,0.0,yes,0.0,183750130.0,C,A,1.0,0.0,0.0,0.0,0.04861,0.0,0.01852,0.0,C,chr1_183411087_183827254,C,A
800,chr1,216945756,C,A,snv,0/1,0/1,0/0,1/1,True,True,False,True,Neanderthal,ESRRG,0.56557,2.5557,0.28379,3.0516,0.655,0.2,0.0,0.0,0.0,0.2,-27,47,-27,48,C,C,derived,169.0,5096.0,0.03,0.14,0.0,0.0,0.01,0.02,yes,0.0,216945755.0,C,A,1.0,0.0,0.0,0.01153,0.13988,0.0,0.0,0.0184,C,chr1_216944862_216964892,C,A
998,chr10,17373518,T,C,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,ST8SIA6,1.0898,-0.43129,1.0086,-0.029733,0.533,0.05,0.22,0.0,0.0,0.22,33,-19,30,-19,T,T,derived,56.0,5096.0,0.01,0.0,0.04,0.0,0.02,0.0,yes,0.0,17373517.0,T,C,1.0,0.01274,0.0,0.01729,0.0,0.03479,0.01852,0.00307,T,chr10_17338795_17417730,T,C
1617,chr11,84191111,T,C,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,DLG2,0.74745,2.0762,0.21228,5.7123,-0.959,0.0,0.0,0.0,0.4,0.4,1,29,1,-1,T,T,derived,12.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.01,yes,0.0,84191110.0,T,C,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01227,T,chr11_84051845_84211811,T,C
1669,chr11,99345359,C,G,snv,0/1,0/1,0/0,1/1,True,True,False,True,Neanderthal,CNTN5,1.0393,-0.33557,0.3272,4.6242,-0.229,0.0,0.0,0.67,0.0,0.67,-4,48,-5,37,C,C,derived,177.0,5096.0,0.03,0.11,0.0,0.0,0.0,0.07,yes,0.0,99345358.0,C,G,1.0,0.00318,0.0,0.0,0.10615,0.0,0.11111,0.06748,C,chr11_99309226_99359652,C,G
3230,chr17,14109360,T,G,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,COX10,1.0316,-0.18011,0.29042,2.7283,0.338,0.2,0.04,0.0,0.0,0.2,1,38,50,-23,T,T,derived,155.0,5096.0,0.03,0.08,0.0,0.0,0.08,0.01,yes,0.0,14109359.0,T,G,1.0,0.00637,0.0,0.08501,0.08333,0.0,0.16667,0.00818,G,chr17_13902596_14111504,T,
3473,chr18,60003587,G,A,snv,0/1,1/1,1/1,1/1,True,True,True,True,Shared,TNFRSF11A,0.80508,1.2894,0.33078,3.0497,0.65,0.0,0.31,0.0,0.28,0.31,3,-40,3,33,G,G,derived,444.0,5096.0,0.09,0.0,0.25,0.01,0.12,0.1,yes,0.003899,60003586.0,G,A,1.0,0.01592,0.00397,0.11671,0.00198,0.2495,0.09259,0.09714,A,chr18_59569521_60089689,G,A
3541,chr19,33410289,T,C,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,CEP89,0.93498,0.47493,0.93973,0.3644,-0.642,0.47,0.0,0.71,0.0,0.71,41,-29,5,-29,C,C,ancestral,4744.0,5096.0,0.93,0.9,0.86,0.99,0.93,0.96,yes,0.999025,33410288.0,C,T,1.0,0.03503,0.00099,0.06916,0.09821,0.14612,0.07407,0.0409,C,chr19_33040186_33456652,T,


Now load in the SpliceAI annotated dataframe where these variants were the alternate allele.

In [527]:
swapped_header = ['chrom','pos','ref_allele','altai_gt','chagyrskaya_gt','denisovan_gt','vindija_gt','alt_allele','annotation','ag_delta','al_delta','dg_delta','dl_delta','ag_pos','al_pos','dg_pos','dl_pos']
swapped = pd.read_csv('../reference_vs_alternate_allele_delta_test/introgressed_spliceai_annotated_refs.txt', sep = '\t', names = swapped_header)
swapped.head(10)

Unnamed: 0,chrom,pos,ref_allele,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,alt_allele,annotation,ag_delta,al_delta,dg_delta,dl_delta,ag_pos,al_pos,dg_pos,dl_pos
0,chr10,1098863,G,1/1,1/1,1/1,1/1,A,WDR37,0.0,0.0,0.0,0.0,-16,-15,-16,45
1,chr10,1113464,T,1/1,1/1,1/1,1/1,A,WDR37,0.0,0.0,0.0,0.0,2,48,18,-6
2,chr10,1175546,G,1/1,1/1,1/1,1/1,A,WDR37,0.0,0.0,0.0,0.0,13,-30,-5,7
3,chr10,1230214,C,1/1,1/1,1/1,1/1,T,ADARB2,0.0,0.0,0.0,0.0,23,-48,45,-22
4,chr10,1460803,A,1/1,1/1,1/1,1/1,C,ADARB2,0.0,0.0,0.0,0.0,45,-8,45,2
5,chr10,1464483,A,1/1,1/1,1/1,1/1,G,ADARB2,0.0,0.0,0.0,0.0,44,-16,-47,37
6,chr10,1682361,A,1/1,1/1,1/1,1/1,T,ADARB2,0.0,0.0,0.0,0.0,31,1,-47,2
7,chr10,1684310,C,1/1,1/1,1/1,1/1,T,ADARB2,0.0,0.0,0.0,0.0,9,1,26,1
8,chr10,1686465,C,1/1,1/1,1/1,1/1,T,ADARB2,0.0,0.0,0.0,0.0,-18,6,3,6
9,chr10,1710328,G,1/1,1/1,1/1,1/1,A,ADARB2,0.0,0.0,0.0,0.0,-22,-41,35,-41


We need to create a delta_max column for comparison.

In [528]:
swapped['delta_max'] = swapped[['ag_delta','al_delta','dg_delta','dl_delta']].max(axis = 1)

In [529]:
len(swapped)

8709

In [530]:
SAV_swapped = swapped[swapped['delta_max']>=0.2]
len(SAV_swapped)

24

Now merge.

In [531]:
introgressed_ref_tag_merge = pd.merge(introgressed_ref_tag, swapped, on = ['chrom','pos','annotation'])
introgressed_ref_tag_merge.head(10)

Unnamed: 0,chrom,pos,ref_allele_x,alt_allele_x,variant_type,altai_gt_x,chagyrskaya_gt_x,denisovan_gt_x,vindija_gt_x,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta_x,al_delta_x,dg_delta_x,dl_delta_x,delta_max_x,ag_pos_x,al_pos_x,dg_pos_x,dl_pos_x,ancestral_allele,temp,anc_dev,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,present_in_1KG,1KG_non_ASW_AFR_AF,start,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_haplotype_tag,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,ref_allele_y,altai_gt_y,chagyrskaya_gt_y,denisovan_gt_y,vindija_gt_y,alt_allele_y,ag_delta_y,al_delta_y,dg_delta_y,dl_delta_y,ag_pos_y,al_pos_y,dg_pos_y,dl_pos_y,delta_max_y
0,chr1,3046711,T,G,snv,0/1,0/0,0/0,1/1,True,False,False,True,Other,PRDM16,0.87043,1.3471,0.081747,5.9522,0.379,0.0,0.0,0.0,0.0,0.0,-4,35,-24,19,T,T,derived,112.0,5096.0,0.02,0.0,0.06,0.0,0.05,0.01,yes,0.0,3046710.0,T,G,1.0,0.01274,0.0,0.04611,0.0,0.06362,0.0,0.01125,T,chr1_3015134_3046826,T,G,G,1/1,1/1,1/1,1/1,T,0.0,0.0,0.0,0.0,35,-4,19,-24,0.0
1,chr1,3046826,G,T,snv,0/1,0/0,0/0,1/1,True,False,False,True,Other,PRDM16,0.87043,1.3471,0.081747,5.9522,0.462,0.0,0.0,0.0,0.0,0.0,-49,-1,11,-1,G,G,derived,112.0,5096.0,0.02,0.0,0.06,0.0,0.05,0.01,yes,0.0,3046825.0,G,T,1.0,0.01274,0.0,0.04611,0.0,0.06362,0.0,0.01125,G,chr1_3015134_3046826,G,T,T,1/1,1/1,1/1,1/1,G,0.0,0.0,0.0,0.0,-1,-49,-1,11,0.0
2,chr1,3188156,G,A,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,0.491,0.0,0.0,0.0,0.0,0.0,-27,9,4,-1,G,G,derived,301.0,5096.0,0.06,0.18,0.0,0.0,0.04,0.08,yes,0.0,3188155.0,G,A,1.0,0.00318,0.0,0.04323,0.18254,0.00298,0.07407,0.07975,G,chr1_3066509_3209504,G,A,A,1/1,1/1,1/1,1/1,G,0.0,0.0,0.0,0.0,9,-27,-1,4,0.0
3,chr1,3199291,C,T,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,0.297,0.0,0.0,0.0,0.0,0.0,28,30,28,-23,C,C,derived,447.0,5096.0,0.09,0.25,0.01,0.0,0.13,0.1,yes,0.0,3199290.0,C,T,1.0,0.00955,0.0,0.1268,0.24603,0.00596,0.11111,0.09509,C,chr1_3066509_3209504,C,T,T,1/1,1/1,1/1,1/1,C,0.0,0.0,0.0,0.0,30,28,-23,28,0.0
4,chr1,3199727,C,T,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,0.436,0.0,0.0,0.0,0.0,0.0,8,44,-2,44,C,C,derived,447.0,5096.0,0.09,0.25,0.01,0.0,0.13,0.1,yes,0.0,3199726.0,C,T,1.0,0.00955,0.0,0.1268,0.24603,0.00596,0.11111,0.09509,C,chr1_3066509_3209504,C,T,T,1/1,1/1,1/1,1/1,C,0.0,0.0,0.0,0.0,44,8,44,-2,0.0
5,chr1,3209504,G,A,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.438,0.0,0.0,0.0,0.0,0.0,20,1,-12,-1,G,G,derived,444.0,5096.0,0.09,0.24,0.01,0.0,0.13,0.09,yes,0.0,3209503.0,G,A,1.0,0.00955,0.0,0.1268,0.24405,0.00596,0.11111,0.09305,G,chr1_3066509_3209504,G,A,A,1/1,1/1,1/1,1/1,G,0.0,0.0,0.0,0.0,1,20,-1,-12,0.0
6,chr1,3210305,T,A,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.379,0.0,0.0,0.0,0.0,0.0,2,7,-4,-21,T,T,derived,443.0,5096.0,0.09,0.25,0.01,0.0,0.13,0.09,yes,0.0,3210304.0,T,A,1.0,0.00955,0.0,0.1268,0.24405,0.00596,0.11111,0.091,T,chr1_3166567_3212428,T,A,A,1/1,1/1,1/1,1/1,T,0.0,0.0,0.0,0.0,7,2,-21,-4,0.0
7,chr1,3212428,A,C,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.445,0.0,0.0,0.0,0.01,0.01,3,11,23,11,A,A,derived,452.0,5096.0,0.09,0.26,0.01,0.0,0.13,0.09,yes,0.0,3212427.0,A,C,1.0,0.00955,0.0,0.13112,0.25595,0.00696,0.11111,0.08384,A,chr1_3166567_3212428,A,C,C,1/1,1/1,1/1,1/1,A,0.0,0.0,0.01,0.0,11,3,11,23,0.01
8,chr1,3240652,C,T,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,PRDM16,0.87043,1.3471,0.081747,5.9522,-0.479,0.0,0.0,0.0,0.0,0.0,10,50,29,9,C,C,derived,26.0,5096.0,0.01,0.0,0.0,0.0,0.0,0.02,yes,0.004873,3240651.0,C,T,1.0,0.00318,0.00397,0.00144,0.0,0.0,0.18519,0.02045,T,chr1_3195401_3250112,C,,T,1/1,1/1,1/1,1/1,C,0.0,0.0,0.0,0.0,50,10,9,29,0.0
9,chr1,3240725,T,C,snv,0/1,0/0,0/0,0/1,True,False,False,True,Other,PRDM16,0.87043,1.3471,0.081747,5.9522,-1.193,0.0,0.0,0.0,0.0,0.0,-39,29,40,-2,T,T,derived,394.0,5096.0,0.08,0.16,0.01,0.0,0.14,0.12,yes,0.0,3240724.0,T,C,1.0,0.00955,0.0,0.13833,0.16171,0.00596,0.11111,0.11656,T,chr1_3240724_3259500,T,C,C,1/1,1/1,1/1,1/1,T,0.0,0.0,0.0,0.0,29,-39,-2,40,0.0


In [532]:
len(introgressed_ref_tag_merge[(introgressed_ref_tag_merge['delta_max_x']>=0.2) & (introgressed_ref_tag_merge['delta_max_y']>=0.2)])

24

In [533]:
introgressed_ref_tag_merge[(introgressed_ref_tag_merge['delta_max_x']>=0.2) & (introgressed_ref_tag_merge['delta_max_y']>=0.2)]

Unnamed: 0,chrom,pos,ref_allele_x,alt_allele_x,variant_type,altai_gt_x,chagyrskaya_gt_x,denisovan_gt_x,vindija_gt_x,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta_x,al_delta_x,dg_delta_x,dl_delta_x,delta_max_x,ag_pos_x,al_pos_x,dg_pos_x,dl_pos_x,ancestral_allele,temp,anc_dev,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,present_in_1KG,1KG_non_ASW_AFR_AF,start,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_haplotype_tag,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2,ref_allele_y,altai_gt_y,chagyrskaya_gt_y,denisovan_gt_y,vindija_gt_y,alt_allele_y,ag_delta_y,al_delta_y,dg_delta_y,dl_delta_y,ag_pos_y,al_pos_y,dg_pos_y,dl_pos_y,delta_max_y
305,chr1,52820680,C,T,snv,0/1,0/0,0/0,0/1,True,False,False,True,Other,CC2D1B,1.1262,-1.0089,0.69443,2.0388,-0.14,0.23,0.0,0.0,0.01,0.23,-22,-40,-22,34,C,C,derived,90.0,5096.0,0.02,0.03,0.0,0.0,0.0,0.05,yes,0.0,52820679.0,C,T,1.0,0.00637,0.0,0.0,0.02877,0.00497,0.0,0.05521,C,chr1_52325475_53062486,C,T,T,1/1,1/1,1/1,1/1,C,0.0,0.23,0.01,0.0,-40,-22,34,-22,0.23
492,chr1,152944501,A,T,snv,0/1,0/0,0/0,0/1,True,False,False,True,Other,SPRR4,1.4955,-1.106,0.71438,0.44285,0.383,0.23,0.0,0.0,0.0,0.23,13,2,12,-1,A,A,derived,2.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.0,yes,0.0,152944500.0,A,T,1.0,0.0,0.0,0.0,0.00198,0.0,0.0,0.0,A,chr1_152942848_153033467,A,T,T,1/1,1/1,1/1,1/1,A,0.0,0.23,0.0,0.0,2,13,-1,12,0.23
575,chr1,183750131,C,A,snv,0/1,0/0,0/0,0/0,True,False,False,False,Altai,RGL1,0.70413,2.2238,0.17753,5.1161,0.655,0.0,0.0,0.26,0.01,0.26,-7,-11,-7,-11,C,C,derived,50.0,5096.0,0.01,0.05,0.0,0.0,0.0,0.0,yes,0.0,183750130.0,C,A,1.0,0.0,0.0,0.0,0.04861,0.0,0.01852,0.0,C,chr1_183411087_183827254,C,A,A,1/1,1/1,1/1,1/1,C,0.0,0.0,0.01,0.26,-11,-7,-11,-7,0.26
1018,chr10,17373518,T,C,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,ST8SIA6,1.0898,-0.43129,1.0086,-0.029733,0.533,0.05,0.22,0.0,0.0,0.22,33,-19,30,-19,T,T,derived,56.0,5096.0,0.01,0.0,0.04,0.0,0.02,0.0,yes,0.0,17373517.0,T,C,1.0,0.01274,0.0,0.01729,0.0,0.03479,0.01852,0.00307,T,chr10_17338795_17417730,T,C,C,1/1,1/1,1/1,1/1,T,0.22,0.05,0.0,0.0,-19,33,-19,30,0.22
1655,chr11,84191111,T,C,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,DLG2,0.74745,2.0762,0.21228,5.7123,-0.959,0.0,0.0,0.0,0.4,0.4,1,29,1,-1,T,T,derived,12.0,5096.0,0.0,0.0,0.0,0.0,0.0,0.01,yes,0.0,84191110.0,T,C,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01227,T,chr11_84051845_84211811,T,C,C,1/1,1/1,1/1,1/1,T,0.0,0.0,0.4,0.0,29,1,-1,1,0.4
1707,chr11,99345359,C,G,snv,0/1,0/1,0/0,1/1,True,True,False,True,Neanderthal,CNTN5,1.0393,-0.33557,0.3272,4.6242,-0.229,0.0,0.0,0.67,0.0,0.67,-4,48,-5,37,C,C,derived,177.0,5096.0,0.03,0.11,0.0,0.0,0.0,0.07,yes,0.0,99345358.0,C,G,1.0,0.00318,0.0,0.0,0.10615,0.0,0.11111,0.06748,C,chr11_99309226_99359652,C,G,G,1/1,1/1,1/1,1/1,C,0.0,0.0,0.0,0.67,48,-4,37,-5,0.67
3326,chr17,14109360,T,G,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,COX10,1.0316,-0.18011,0.29042,2.7283,0.338,0.2,0.04,0.0,0.0,0.2,1,38,50,-23,T,T,derived,155.0,5096.0,0.03,0.08,0.0,0.0,0.08,0.01,yes,0.0,14109359.0,T,G,1.0,0.00637,0.0,0.08501,0.08333,0.0,0.16667,0.00818,G,chr17_13902596_14111504,T,,G,1/1,1/1,1/1,1/1,T,0.04,0.21,0.0,0.0,38,1,-23,50,0.21
3577,chr18,60003587,G,A,snv,0/1,1/1,1/1,1/1,True,True,True,True,Shared,TNFRSF11A,0.80508,1.2894,0.33078,3.0497,0.65,0.0,0.31,0.0,0.28,0.31,3,-40,3,33,G,G,derived,444.0,5096.0,0.09,0.0,0.25,0.01,0.12,0.1,yes,0.003899,60003586.0,G,A,1.0,0.01592,0.00397,0.11671,0.00198,0.2495,0.09259,0.09714,A,chr18_59569521_60089689,G,A,A,1/1,1/1,1/1,1/1,G,0.3,0.0,0.29,0.0,-40,3,33,3,0.3
3645,chr19,33410289,T,C,snv,0/0,0/0,1/1,0/0,False,False,True,False,Denisovan,CEP89,0.93498,0.47493,0.93973,0.3644,-0.642,0.47,0.0,0.71,0.0,0.71,41,-29,5,-29,C,C,ancestral,4744.0,5096.0,0.93,0.9,0.86,0.99,0.93,0.96,yes,0.999025,33410288.0,C,T,1.0,0.03503,0.00099,0.06916,0.09821,0.14612,0.07407,0.0409,C,chr19_33040186_33456652,T,,C,1/1,1/1,1/1,1/1,T,0.0,0.47,0.0,0.71,-29,41,-29,5,0.71
4708,chr20,62224595,G,A,snv,0/1,1/1,0/0,0/1,True,True,False,True,Neanderthal,GMEB2,0.66048,2.202,0.049787,3.9462,-0.837,0.01,0.0,0.41,0.02,0.41,15,-44,2,-2,G,G,derived,1138.0,5096.0,0.22,0.51,0.21,0.01,0.11,0.31,yes,0.000975,62224594.0,G,A,1.0,0.03503,0.00099,0.10663,0.51488,0.21173,0.03704,0.30982,G,chr20_62195671_62229244,G,A,A,1/1,1/1,1/1,1/1,G,0.0,0.01,0.02,0.41,-44,15,-2,2,0.41


Let's identify the two positions where the deltas were different.

In [534]:
diffs = pd.merge(SAV_introgressed_ref_tag[['chrom','pos']], SAV_swapped[['chrom','pos']], how = 'left', indicator=True)
diffs

Unnamed: 0,chrom,pos,_merge
0,chr1,52820680,both
1,chr1,152944501,both
2,chr1,183750131,both
3,chr1,216945756,left_only
4,chr10,17373518,both
5,chr11,84191111,both
6,chr11,99345359,both
7,chr17,14109360,both
8,chr18,60003587,both
9,chr19,33410289,both


Let's look at the actual deltas. First up is chr1: 216,945,756.

In [535]:
swapped[(swapped['chrom'] == 'chr1') & (swapped['pos'] == 216945756)]

Unnamed: 0,chrom,pos,ref_allele,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,alt_allele,annotation,ag_delta,al_delta,dg_delta,dl_delta,ag_pos,al_pos,dg_pos,dl_pos,delta_max
3553,chr1,216945756,A,1/1,1/1,1/1,1/1,C,ESRRG,0.0,0.19,0.0,0.0,47,-27,48,-27,0.19


In [536]:
SAV_introgressed_ref_tag[(SAV_introgressed_ref_tag['chrom'] == 'chr1') & (SAV_introgressed_ref_tag['pos'] == 216945756)]

Unnamed: 0,chrom,pos,ref_allele,alt_allele,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,ancestral_allele,temp,anc_dev,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,present_in_1KG,1KG_non_ASW_AFR_AF,start,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_haplotype_tag,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2
800,chr1,216945756,C,A,snv,0/1,0/1,0/0,1/1,True,True,False,True,Neanderthal,ESRRG,0.56557,2.5557,0.28379,3.0516,0.655,0.2,0.0,0.0,0.0,0.2,-27,47,-27,48,C,C,derived,169.0,5096.0,0.03,0.14,0.0,0.0,0.01,0.02,yes,0.0,216945755.0,C,A,1.0,0.0,0.0,0.01153,0.13988,0.0,0.0,0.0184,C,chr1_216944862_216964892,C,A


Now the other allele: chr7: 157,177,273.

In [537]:
swapped[(swapped['chrom'] == 'chr7') & (swapped['pos'] == 157177273)]

Unnamed: 0,chrom,pos,ref_allele,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,alt_allele,annotation,ag_delta,al_delta,dg_delta,dl_delta,ag_pos,al_pos,dg_pos,dl_pos,delta_max
8039,chr7,157177273,T,1/1,1/1,1/1,1/1,C,DNAJB6,0.0,0.0,0.03,0.16,12,-2,12,-2,0.16


In [538]:
SAV_introgressed_ref_tag[(SAV_introgressed_ref_tag['chrom'] == 'chr7') & (SAV_introgressed_ref_tag['pos'] == 157177273)]

Unnamed: 0,chrom,pos,ref_allele,alt_allele,variant_type,altai_gt,chagyrskaya_gt,denisovan_gt,vindija_gt,altai_gt_boolean,chagyrskaya_gt_boolean,denisovan_gt_boolean,vindija_gt_boolean,distribution,annotation,mis_oe,mis_z,lof_oe,lof_z,phyloP,ag_delta,al_delta,dg_delta,dl_delta,delta_max,ag_pos,al_pos,dg_pos,dl_pos,ancestral_allele,temp,anc_dev,1KG_allele_count,1KG_allele_number,1KG_allele_frequency,1KG_EAS_AF,1KG_EUR_AF,1KG_AFR_AF,1KG_AMR_AF,1KG_SAS_AF,present_in_1KG,1KG_non_ASW_AFR_AF,start,Vernot_ancestral_allele,Vernot_derived_allele,Vernot_ancestral_derived_code,Vernot_AFA_AF,Vernot_AFR_AF,Vernot_AMR_AF,Vernot_EAS_AF,Vernot_EUR_AF,Vernot_PNG_AF,Vernot_SAS_AF,Vernot_Denisovan_base,Vernot_haplotype_tag,Vernot_Neanderthal_base_1,Vernot_Neanderthal_base_2
7341,chr7,157177273,C,T,snv,0/1,1/1,0/0,1/1,True,True,False,True,Neanderthal,DNAJB6,0.76273,1.1745,0.15377,3.4636,0.455,0.0,0.0,0.31,0.0,0.31,-2,46,-2,2,C,C,derived,57.0,5096.0,0.01,0.05,0.0,0.0,0.0,0.0,yes,0.0,157177272.0,C,T,1.0,0.0,0.0,0.0,0.05357,0.0,0.0,0.00204,C,chr7_157116267_157177286,C,T
