In [1]:
import pandas as pd
from pysam import FastaFile

In [2]:
hg19 = FastaFile('/mnt/solexa/Genomes/hg19_evan/whole_genome.fa')

Read the table of archaic admixture array data generated from the two files from David

In [3]:
old_snps = pd.read_table('../tmp/i.tsv')
# fetch the alleles from the hg19 reference genome
old_snps['r'] = [hg19.fetch(region='{}:{}-{}'.format(chrom, pos, pos)) for chrom, pos in zip(old_snps.chrom, old_snps.pos)]

Read the table of archaic admixture array data generated from the EIGENSTRAT data published online

In [4]:
new_snps = pd.read_table('../tmp/t.tsv')
# fetch the alleles from the hg19 reference genome
new_snps['r'] = [hg19.fetch(region='{}:{}-{}'.format(chrom, pos, pos)) for chrom, pos in zip(new_snps.chrom, new_snps.pos)]

## Why don't the "ref"/"alt" columns sometimes match the true hg19 allele?

What is the proportion of 0/1/9 values in the Href "sample"?

In [5]:
sum(old_snps.Href == 0) / len(old_snps), sum(old_snps.Href == 1) / len(old_snps), sum(old_snps.Href == 9) / len(old_snps)

(0.99046682089064342, 0.0094996521578784143, 3.3526951478119475e-05)

What are the Href values at sites where the fetched reference **doesn't** match either of the allele columns in the David's and Qiaomei's data:

In [6]:
old_snps.query('r != ref & r != alt').Href.value_counts()

9    32
Name: Href, dtype: int64

What is the total count of the 9 missing Href values?

In [7]:
old_snps.Href.value_counts()

0    945357
1      9067
9        32
Name: Href, dtype: int64

OK, so all of the missing Href values correspond to sites where the true reference doesn't match either of the two alleles in the snp file. Perhaps these are triallelic positions that someone escaped the filtering, due to some bug?

In any case, these sites should be definitely removed. Here's the complete list:

In [22]:
old_snps.query('Href == 9')[['chrom', 'pos', 'ref', 'alt', 'r']]

Unnamed: 0,chrom,pos,ref,alt,r
56415,1,194238557,A,G,T
63749,1,214486345,G,C,T
64487,1,216154223,T,G,C
65880,1,220214623,G,A,C
167218,3,8213832,A,T,C
216271,3,144867812,A,G,C
232706,3,188486067,A,C,T
233068,3,189480568,A,C,T
247954,4,28021832,G,T,C
315555,5,16290023,T,G,A


#### Conclusion

There are in total 32 "missing" Href sites, and these are exactly the ones where the ref/alt columns in the genosnp file differ from the "true" allele in the hg19 genome.

It's hard to say why are these sites listed as missing in the reference genome, if they are not really missing. Must be some bug in a script that generated the snp files.

## "ref" column defined by EIGENSTRAT is not always the same as the true hg19 allele

Sometimes, the true reference allele corresponds to the alt column in the file. In fact, most of the time:

In [60]:
old_snps.query('ref == r').Href.value_counts()

1    9067
Name: Href, dtype: int64

In [61]:
old_snps.query('alt == r').Href.value_counts()

0    945357
Name: Href, dtype: int64

This was from the old two-parts David's file. The numbers are the same in the latest file published on their web:

In [66]:
len(new_snps.query('ref == r'))

9067

In [67]:
len(new_snps.query('alt == r'))

945357

Also, it can be seen that these two cases (true reference matches ref or alt columns) correspond exactly to the 0s and 1s in the Href column.

Perhaps this happened by merging two datasets of different formats? In any case, it should be easy to fix this by just using the fetched hg19 allele values and correcting the "flipped" ref/alt columns.

**Moreover**, it's clear from this that the "ref" column doesn't actually contain hg19 reference alleles. Somehow the columns must be flipped (if they are really supposed to be in the EIGENSTRAT format).

### Distribution of 0/1/9 at sites where "true" ref allele equals the "alt" in the snp file

In [75]:
new_snps.query('alt == r').Ust_Ishim.value_counts()

0    886905
1     46818
9      9186
2      2448
Name: Ust_Ishim, dtype: int64

In [47]:
new_snps.query('alt == r & Ust_Ishim == 2')[['chrom', 'pos', 'ref', 'alt', 'r', 'Ust_Ishim']].head(3)

Unnamed: 0,chrom,pos,ref,alt,r,Ust_Ishim
1690,1,4965738,A,C,C,2
1691,1,4965740,G,C,C,2
1696,1,4965748,G,C,C,2


In [48]:
%%bash 
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:4965738
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:4965740
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:4965748

1	4965738	.	C	A	38.39	.	AC=2;AF=1;AN=2;DP=2;Dels=0;FS=0;HRun=0;HaplotypeScore=0;MQ=31.58;MQ0=0;QD=19.19;UR;TS=HPGOMC;TSseq=C,C,C,C,C,C;CAnc=C;GAnc=C;OAnc=C;bSC=978;mSC=0;pSC=0.072;GRP=-3.14;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	1/1:2:6.02:70,6,0:1,1:0,0:0,0:0,0:0
1	4965740	.	C	G	38.39	.	AC=2;AF=1;AN=2;DP=2;Dels=0;FS=0;HRun=0;HaplotypeScore=0;MQ=31.58;MQ0=0;QD=19.19;UR;TS=HPGOMC;TSseq=C,C,C,C,C,C;CAnc=C;GAnc=C;OAnc=C;bSC=978;mSC=0;pSC=0.06;GRP=-5.14;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	1/1:2:6.02:70,6,0:0,0:0,0:1,1:0,0:0
1	4965748	.	C	G	86.55	.	AC=2;AF=1;AN=2;DP=3;Dels=0;FS=0;HRun=0;HaplotypeScore=0;MQ=37;MQ0=0;QD=28.85;UR;TS=HPGOMC;TSseq=C,C,C,C,C,C;CAnc=C;GAnc=C;OAnc=C;bSC=978;mSC=0.003;pSC=0.038;GRP=0.602;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	1/1:3:9.03:119,9,0:0,0:0,0:1,2:0,0:0


In [49]:
new_snps.query('alt == r & Ust_Ishim == 0')[['chrom', 'pos', 'ref', 'alt', 'r', 'Ust_Ishim']].head(3)

Unnamed: 0,chrom,pos,ref,alt,r,Ust_Ishim
0,1,847983,T,C,C,0
1,1,853089,C,G,G,0
2,1,853596,G,A,A,0


In [50]:
%%bash 
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:847983
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:853089
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:853596

1	847983	rs79932038	C	.	101.35	.	AC=0;AF=0;AN=2;DP=37;MQ=36.87;MQ0=0;1000gALT=T;AF1000g=0;ASN_AF=0.01;CpG;RM;TS=HP;TSseq=C,C;CAnc=C;bSC=958;mSC=0.058;pSC=0.166;GRP=1.42;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/0:37:71.36:0,71,1385:0,0:16,20:0,0:1,0:0
1	853089	rs78738176	G	.	72.13	.	AC=0;AF=0;AN=2;DP=15;MQ=37;MQ0=0;1000gALT=C;AF1000g=0;ASN_AF=0.01;TS=HP;TSseq=G,G;CAnc=G;bSC=958;pSC=0.067;GRP=0.427;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/0:15:42.14:0,42,552:0,0:0,0:6,9:0,0:0
1	853596	rs191666748	A	.	81.15	.	AC=0;AF=0;AN=2;DP=21;MQ=37;MQ0=0;1000gALT=G;AF1000g=0;ASN_AF=0.01;TS=HP;TSseq=A,G;CAnc=A;bSC=958;pSC=0.048;GRP=1.5;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/0:21:51.16:0,51,654:8,13:0,0:0,0:0,0:0


In [51]:
new_snps.query('alt == r & Ust_Ishim == 1')[['chrom', 'pos', 'ref', 'alt', 'r', 'Ust_Ishim']].head(3)

Unnamed: 0,chrom,pos,ref,alt,r,Ust_Ishim
30,1,1125110,C,T,T,1
177,1,1891394,T,C,C,1
224,1,1969741,A,G,G,1


In [52]:
%%bash 
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:1125110
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:1891394
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:1969741

1	1125110	rs12124436	T	C	609.72	.	AC=1;AF=0.5;AN=2;BaseQRankSum=0.894;DP=40;Dels=0;FS=0;HRun=0;HaplotypeScore=1.9971;MQ=37.75;MQ0=0;MQRankSum=-1.815;QD=15.24;ReadPosRankSum=-1.002;1000gALT=C;AF1000g=0.24;AFR_AF=0.02;AMR_AF=0.3;ASN_AF=0.56;EUR_AF=0.12;RM;TS=HPGOM;TSseq=T,T,T,T,N;CAnc=T;GAnc=T;OAnc=T;bSC=878;mSC=0.048;pSC=0.271;GRP=0;Map20=0.25	GT:DP:GQ:PL:A:C:G:T:IR	0/1:40:99:640,0,720:0,0:11,8:0,0:12,9:0
1	1891394	rs142334243	C	T	201.87	.	AC=1;AF=0.5;AN=2;BaseQRankSum=-0.058;DP=15;Dels=0;FS=2.081;HRun=1;HaplotypeScore=0;MQ=37;MQ0=0;MQRankSum=-0.058;QD=13.46;ReadPosRankSum=-0.289;1000gALT=T;AF1000g=0;AFR_AF=0.02;TS=HPGOMC;TSseq=C,-,C,T,T,T;CAnc=C;GAnc=C;OAnc=T;bSC=873	GT:DP:GQ:PL:A:C:G:T:IR	0/1:15:99:232,0,274:0,0:5,3:0,0:3,4:0
1	1969741	rs76295110	G	A	77.54	.	AC=1;AF=0.5;AN=2;BaseQRankSum=0.832;DP=32;Dels=0;FS=11.826;HRun=0;HaplotypeScore=0.9997;MQ=37;MQ0=0;MQRankSum=-2.283;QD=2.42;ReadPosRankSum=0.886;1000gALT=A;AF1000g=0.01;ASN_AF=0.03;CpG;RM;TS=HPGOMC;TSseq=G,G,G,A,A,A;CAnc=G;GAnc=G

### Distribution of 0/1/9 at sites where "true" ref allele equals the "ref" in the snp file

In [68]:
new_snps.query('ref == r').Ust_Ishim.value_counts()

0    7562
1    1237
2     171
9      97
Name: Ust_Ishim, dtype: int64

In [69]:
new_snps.query('ref == r & Ust_Ishim == 2')[['chrom', 'pos', 'ref', 'alt', 'r', 'Ust_Ishim']].head(3)

Unnamed: 0,chrom,pos,ref,alt,r,Ust_Ishim
2726,1,7383183,T,A,T,2
2727,1,7383184,T,A,T,2
18632,1,58222892,T,G,T,2


Lets look at the first three sites

In [40]:
%%bash 
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:7383183
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:7383184
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:58222892

1	7383183	rs6686160	T	.	183.49	.	AC=0;AF=0;AN=2;DP=52;MQ=37;MQ0=0;1000gALT=A;AF1000g=0.75;AFR_AF=0.88;AMR_AF=0.75;ASN_AF=0.82;EUR_AF=0.61;RM;UR;TS=HPGOMC;TSseq=T,T,T,T,TTTTTTA,T;CAnc=T;GAnc=T;OAnc=T;bSC=957;mSC=0.001;pSC=0.033;GRP=1.26;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/0:52:99:0,153,2014:0,0:0,0:0,0:26,26:0
1	7383184	rs76405761	T	.	186.5	.	AC=0;AF=0;AN=2;DP=52;MQ=37;MQ0=0;1000gALT=A;AF1000g=0.75;AFR_AF=0.91;AMR_AF=0.75;ASN_AF=0.82;EUR_AF=0.61;RM;UR;TS=HPGOMC;TSseq=T,A,T,A,A,A;CAnc=T;GAnc=T;OAnc=A;bSC=957;mSC=0.001;pSC=0.03;GRP=-2.03;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/0:52:99:0,157,2055:0,0:0,0:0,0:26,26:0
1	58222892	rs3118052	T	.	63.1	.	AC=0;AF=0;AN=2;DP=14;MQ=37.53;MQ0=0;1000gALT=G,TG;AF1000g=1;AFR_AF=1;AMR_AF=1;ASN_AF=1;EUR_AF=1;RM;UR;TS=HPGOMC;TSseq=T,G,G,G,G,A;CAnc=G;GAnc=G;OAnc=G;bSC=939;mSC=0;pSC=0.007;GRP=-1.46;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/0:14:33.11:0,33,436:0,0:0,0:0,0:7,7:0


In [41]:
new_snps.query('ref == r & Ust_Ishim == 0')[['chrom', 'pos', 'ref', 'alt', 'r', 'Ust_Ishim']].head(3)

Unnamed: 0,chrom,pos,ref,alt,r,Ust_Ishim
393,1,2256245,A,T,A,0
719,1,2998934,T,C,T,0
720,1,2999202,T,C,T,0


Lets look at the first three sites

In [42]:
%%bash 
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:2256245
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:2998934
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:2999202

1	2256245	rs2843148	A	T	1392.14	.	AC=2;AF=1;AN=2;DP=36;Dels=0;FS=0;HRun=0;HaplotypeScore=0;MQ=37;MQ0=0;QD=38.67;1000gALT=T;AF1000g=0.78;AFR_AF=0.96;AMR_AF=0.6;ASN_AF=0.9;EUR_AF=0.64;TS=HPGOMC;TSseq=A,T,T,T,-,T;CAnc=T;GAnc=T;OAnc=T;bSC=923;mSC=0.002;pSC=0.011;GRP=-1.64;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	1/1:36:99:1425,108,0:0,0:0,0:0,0:19,17:0
1	2998934	rs2742686	T	C	907.37	.	AC=2;AF=1;AN=2;DP=24;Dels=0;FS=0;HRun=1;HaplotypeScore=0;MQ=38.24;MQ0=0;QD=37.81;1000gALT=C;AF1000g=0.92;AFR_AF=0.99;AMR_AF=0.9;ASN_AF=0.97;EUR_AF=0.84;UR;TS=HPGOMC;TSseq=T,C,C,C,C,C;CAnc=C;GAnc=C;OAnc=C;bSC=980;mSC=0;pSC=0.012;GRP=-2.09;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	1/1:24:72.22:940,72,0:0,0:11,13:0,0:0,0:0
1	2999202	rs2742685	T	C	793.27	.	AC=2;AF=1;AN=2;DP=21;Dels=0;FS=0;HRun=0;HaplotypeScore=0;MQ=37;MQ0=0;QD=37.77;1000gALT=C;AF1000g=0.93;AFR_AF=0.99;AMR_AF=0.92;ASN_AF=1;EUR_AF=0.84;CpG;UR;TS=HPGOMC;TSseq=T,C,C,C,C,T;CAnc=C;GAnc=C;OAnc=C;bSC=981;mSC=0;pSC=0.005;GRP=-4.11;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	1/1:21:63.2:82

In [43]:
new_snps.query('ref == r & Ust_Ishim == 1')[['chrom', 'pos', 'ref', 'alt', 'r', 'Ust_Ishim']].head(3)

Unnamed: 0,chrom,pos,ref,alt,r,Ust_Ishim
2795,1,7549220,A,G,A,1
4997,1,14126453,C,T,C,1
5668,1,15433638,A,C,A,1


Lets look at the first three sites

In [44]:
%%bash 
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:7549220
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:14126453
bcftools view -H /mnt/454/Ust_Ishim/1_Extended_VCF/Ust_Ishim.hg19_1000g.1.mod.vcf.gz 1:15433638

1	7549220	rs1725272	A	G	407.57	.	AC=1;AF=0.5;AN=2;BaseQRankSum=1.474;DP=28;Dels=0;FS=0;HRun=2;HaplotypeScore=0;MQ=39.09;MQ0=0;MQRankSum=0.691;QD=14.56;ReadPosRankSum=-0.553;1000gALT=G;AF1000g=0.74;AFR_AF=0.94;AMR_AF=0.66;ASN_AF=0.8;EUR_AF=0.61;CpG;TS=HPGOMC;TSseq=A,G,G,G,G,G;CAnc=G;GAnc=G;OAnc=G;bSC=955;mSC=0;pSC=0.015;GRP=-1.84;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/1:28:99:438,0,495:4,11:0,0:4,9:0,0:0
1	14126453	rs2014788	C	T	963.19	.	AC=1;AF=0.5;AN=2;BaseQRankSum=0.156;DP=56;Dels=0;FS=2.277;HRun=2;HaplotypeScore=2.7881;MQ=37.36;MQ0=0;MQRankSum=0.312;QD=17.2;ReadPosRankSum=-0.729;1000gALT=T;AF1000g=0.64;AFR_AF=0.95;AMR_AF=0.62;ASN_AF=0.54;EUR_AF=0.52;RM;UR;TS=HPGOMC;TSseq=C,T,T,T,T,T;CAnc=T;GAnc=T;OAnc=T;bSC=824;mSC=0.001;pSC=0.032;GRP=0.926;Map20=1	GT:DP:GQ:PL:A:C:G:T:IR	0/1:56:99:993,0,829:0,1:15,10:0,0:16,14:0
1	15433638	rs545416	A	C	479.2	.	AC=1;AF=0.5;AN=2;BaseQRankSum=1.002;DP=41;Dels=0;FS=0;HRun=0;HaplotypeScore=3.9061;MQ=36.64;MQ0=0;MQRankSum=1.216;QD=11.69;ReadPosRankSum=1.697;100