# Compare GATK and bcftools

So there are 2 tools to conduct this filtering on read depth and allele balance:
- gatk
  - the one that Xiaomeng used earlier
- bcftools
  - the promised child, the one vcf reader to rule them all

There are many tradeoffs to each one:
- gatk is like 10 lines of code
- gatk takes 3x longer to run
- gatk is a pain in the a\$\$ to install
- gatk produces a vcf file output instead of just the variant list lots of extra data
- bcftools is easy to install
- bcftools has little to offer in manuals or help
- bcftools requires additional code to be written to run it & process it
- bcftools does not parallelize by default
  - opportunities for speedup via parallelization
- bcftools even without parallel is much faster than gatk

I believe they might give different results due to the different interpretations of the problem statement
- at least 1 sample has allele balance greater than XX
  - is it the ref allele balance or the alt allele balance?
  - how are poly-allelic snps considered?
- minimum read depth of XX
  - is this the minimum read depth across all samples?
  - is this the average read depth?
  - different papers do different things
    - original paper makes it seem like it is average
    - other people recommend to take the minimum
    - goncalo paper replaces low read depths with missing & imputes using TopMed
    - Antonio proposes the Speliotes metric: "take the 1st quartile"
    
    
This notebook looks at the output on 1 batch from chromosome 21 to determine whether if any overlap between the gatk approach or my bcftools approach.

In [1]:
import pandas as pd

In [7]:
gatk = pd.read_table("../code/../code/gatk_variants.out", header = 0,names = ["CHR","POS","ID"])
gatk

Unnamed: 0,CHR,POS,ID
0,chr21,46356886,chr21_46356886_A_T
1,chr21,46356887,chr21_46356887_T_C
2,chr21,46356895,chr21_46356895_C_T
3,chr21,46356896,chr21_46356896_G_A
4,chr21,46356899,chr21_46356899_C_G
...,...,...,...
12402,chr21,46664462,chr21_46664462_C_T
12403,chr21,46664463,chr21_46664463_C_G
12404,chr21,46664464,chr21_46664464_T_G
12405,chr21,46664470,chr21_46664470_G_A


In [8]:
bcf_py = pd.read_table("../code/bcf_processed.txt",header = 0,names = ["ID","AVG_DP","MIN_DP","best_AB"])
bcf_py

Unnamed: 0,ID,AVG_DP,MIN_DP,best_AB
0,chr21_46356886_A_T,8.309819,0,0.285714
1,chr21_46356887_T_C,8.285483,0,0.400000
2,chr21_46356895_C_T,9.735128,0,0.375000
3,chr21_46356896_G_A,10.274537,0,0.200000
4,chr21_46356899_C_G,10.448055,0,0.500000
...,...,...,...,...
14749,chr21_46664462_C_T,9.873032,0,0.250000
14750,chr21_46664463_C_G,9.828587,0,0.500000
14751,chr21_46664464_T_G,9.652552,0,0.285714
14752,chr21_46664470_G_A,9.211125,0,0.500000


In [53]:
gatk_ids = set()
multi_allelic = set()
for i in gatk.ID:
    if ";" in i:
        for j in i.split(";"):
            gatk_ids.add(j)
            multi_allelic.add(j)
    else:
        gatk_ids.add(i)
print(len(multi_allelic))
len(gatk_ids)

4247


14754

Well this some bullshit, gatk doesn't get rid of any variants. We can tell from the bcftools results that some variants should be removed due to low allelic balance and insufficient read depth.

It could be possible that the vlist file includes a list of the variants that pass and fail the criteria. Yes, after checking the gatk file is not the final output file in this filter, it then gets unziped and the header information is read into R for further processing (filter gets evaluated).

In [57]:
gatk_variants = []
header = None
with open("../code/gatk_final_output.txt","r") as f:
    for line in f:
        if line.startswith("##"):
            pass
        elif line.startswith("#"):
            header = line.strip().split("\t")
        else:
            gatk_variants.append({k:v for v,k in zip(line.strip().split("\t"),header)})
gatk = pd.DataFrame(gatk_variants)
gatk

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT
0,chr21,46356881,chr21_46356881_C_T,C,T,43,PASS,AF=1e-06;AQ=43,GT:AD:DP:FT:GQ:PL:RNC
1,chr21,46356886,chr21_46356886_A_T,A,T,46,PASS,AF=5e-06;AQ=46,GT:AD:DP:FT:GQ:PL:RNC
2,chr21,46356887,chr21_46356887_T_C,T,C,38,AllGtsFiltered,AF=1e-06;AQ=38,GT:AD:DP:FT:GQ:PL:RNC
3,chr21,46356895,chr21_46356895_C_T,C,T,50,PASS,AF=2.9e-05;AQ=50,GT:AD:DP:FT:GQ:PL:RNC
4,chr21,46356896,chr21_46356896_G_A,G,A,43,PASS,AF=6e-06;AQ=43,GT:AD:DP:FT:GQ:PL:RNC
...,...,...,...,...,...,...,...,...,...
12403,chr21,46664462,chr21_46664462_C_T,C,T,47,PASS,AF=1e-06;AQ=47,GT:AD:DP:FT:GQ:PL:RNC
12404,chr21,46664463,chr21_46664463_C_G,C,G,38,PASS,AF=1e-06;AQ=38,GT:AD:DP:FT:GQ:PL:RNC
12405,chr21,46664464,chr21_46664464_T_G,T,G,47,PASS,AF=1e-06;AQ=47,GT:AD:DP:FT:GQ:PL:RNC
12406,chr21,46664470,chr21_46664470_G_A,G,A,45,PASS,AF=2e-06;AQ=45,GT:AD:DP:FT:GQ:PL:RNC


In [58]:
set(gatk.FILTER)

{'AllGtsFiltered', 'AllGtsFiltered;AlleleBalance', 'AlleleBalance', 'PASS'}

In [60]:
gatk_variants_passed = []
for i in gatk[gatk.FILTER == "PASS"].ID:
    if ";" in i:
        for j in i.split(";"):
            gatk_variants_passed.append(j)
    else:
        gatk_variants_passed.append(i)
print(len(gatk_variants_passed))
gatk_ids = set(gatk_variants_passed)

14652


In [47]:
snp = []
for i in bcf_py.ID:
    c,p,r,a = i.split("_")
    snp.append(len(r) == 1 and len(a) == 1)
bcf_py["SNP"] = snp
sum(bcf_py.SNP)

13655

In [48]:
# useful tinkering
# sum(bcf_py[bcf_py.SNP].best_AB >= 0.15)
# sum(bcf_py[bcf_py.SNP].AVG_DP >= 7)
# sum(map(all,zip(bcf_py[~bcf_py.SNP].AVG_DP >= 10,bcf_py[~bcf_py.SNP].best_AB >= 0.2)))

sum(map(all,zip(bcf_py.AVG_DP >= 7,bcf_py.best_AB >= 0.15)))

14589

In [88]:
bcf_ids = set(bcf_py[list(map(all,zip(bcf_py.AVG_DP >= 7,bcf_py.MIN_DP >= 1,bcf_py.best_AB >= 0.15)))].ID)
print("gatk ids: {}\nbcf ids: {}".format(len(gatk_ids),len(bcf_ids)))

gatk ids: 14652
bcf ids: 10936


In [86]:
gatk_elim = set(bcf_py.ID) - gatk_ids
bcf_elim = set(bcf_py.ID) - bcf_ids

In [87]:
bcf_py[[i in (bcf_elim & gatk_elim) for i in bcf_py.ID]]

Unnamed: 0,ID,AVG_DP,MIN_DP,best_AB,SNP,ELIM,INTEREST,KEEP,GATK_ELIM
1,chr21_46356887_T_C,8.285483,0,0.400000,True,True,False,True,True
152,chr21_46357235_T_C,11.353105,0,0.600000,True,True,False,True,True
170,chr21_46357274_G_A,7.061645,0,0.666667,True,True,False,True,True
177,chr21_46363419_C_G,12.121438,0,0.333333,True,True,False,True,True
433,chr21_46366498_T_G,8.618328,0,0.400000,True,True,False,True,True
...,...,...,...,...,...,...,...,...,...
14473,chr21_46662002_A_ATGGGGCGCGCAGGAGGGGG,6.689206,0,0.666667,False,False,False,True,True
14477,chr21_46662019_G_C,5.135337,0,0.250000,True,False,False,True,True
14482,chr21_46662033_G_A,3.103009,0,1.000000,True,False,False,True,True
14636,chr21_46664200_C_G,9.176449,0,0.800000,True,True,False,True,True


In [79]:
gatk[["Bal" in f for f in gatk.FILTER]]

Unnamed: 0,#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT
1843,chr21,46398353,chr21_46398353_CTT_C;chr21_46398354_T_C;chr21_...,CTT,"C,CCT,TTT",48,AlleleBalance,"AF=1e-06,1e-06,1e-06;AQ=48,45,41",GT:AD:DP:FT:GQ:PL:RNC
1845,chr21,46399500,chr21_46399500_G_T,G,T,37,AllGtsFiltered;AlleleBalance,AF=1e-06;AQ=37,GT:AD:DP:FT:GQ:PL:RNC
1961,chr21,46399895,chr21_46399895_C_T,C,T,45,AlleleBalance,AF=1e-06;AQ=45,GT:AD:DP:FT:GQ:PL:RNC
3179,chr21,46418404,chr21_46418404_C_G,C,G,41,AlleleBalance,AF=1e-06;AQ=41,GT:AD:DP:FT:GQ:PL:RNC
3439,chr21,46426067,chr21_46426069_C_CT;chr21_46426069_CTT_C;chr21...,TTCTTTTTTTTTTTTTTTTTT,"TTCTTTTTTTTTTTTTTTTTTT,TTCTTTTTTTTTTTTTTTT,TTC...",50,AlleleBalance,"AF=0.000365,0.000284,0.000168,7.4e-05,6.4e-05,...",GT:AD:DP:FT:GQ:PL:RNC
3771,chr21,46429962,chr21_46429962_G_A,G,A,26,AlleleBalance,AF=2e-06;AQ=26,GT:AD:DP:FT:GQ:PL:RNC
4725,chr21,46436916,chr21_46436916_T_C,T,C,40,AllGtsFiltered;AlleleBalance,AF=1e-06;AQ=40,GT:AD:DP:FT:GQ:PL:RNC
5191,chr21,46441175,chr21_46441175_G_C,G,C,38,AllGtsFiltered;AlleleBalance,AF=2e-06;AQ=38,GT:AD:DP:FT:GQ:PL:RNC
6477,chr21,46529251,chr21_46529251_A_C,A,C,35,AllGtsFiltered;AlleleBalance,AF=1e-06;AQ=35,GT:AD:DP:FT:GQ:PL:RNC
6847,chr21,46537134,chr21_46537134_T_C;chr21_46537134_T_A,T,"C,A",44,AlleleBalance,"AF=1e-06,1e-06;AQ=44,35",GT:AD:DP:FT:GQ:PL:RNC


In [77]:
bcf_py[[i in (bcf_elim - gatk_elim) for i in bcf_py.ID]]

Unnamed: 0,ID,AVG_DP,MIN_DP,best_AB,SNP,ELIM,INTEREST,KEEP,GATK_ELIM
171,chr21_46357279_G_A,6.462973,0,0.800000,True,False,False,True,False
172,chr21_46357281_A_G,6.044018,0,0.250000,True,False,False,True,False
173,chr21_46357289_G_A,5.759522,0,0.545455,True,False,False,True,False
359,chr21_46363883_C_T,63.403350,13,0.000000,True,False,False,False,False
360,chr21_46363882_G_A,63.403350,13,0.000000,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...
14476,chr21_46662011_G_A,5.984520,0,0.285714,True,False,False,True,False
14478,chr21_46662022_T_G,4.603525,0,0.285714,True,False,False,True,False
14479,chr21_46662025_G_A,3.905752,0,0.666667,True,False,False,True,False
14480,chr21_46662026_A_G,3.476127,0,0.750000,True,False,False,True,False


In [62]:
bcf_py["KEEP"] = list(map(all,zip(bcf_py.AVG_DP >= 0,bcf_py.best_AB >= 0.15)))
eliminated_from_gatk_but_not_bcftools = set(bcf_py[bcf_py.KEEP].ID) - gatk_ids
bcf_py["GATK_ELIM"] = [True if i in eliminated_from_gatk_but_not_bcftools else False for i in bcf_py.ID]
bcf_py[bcf_py.GATK_ELIM]

Unnamed: 0,ID,AVG_DP,MIN_DP,best_AB,SNP,ELIM,INTEREST,KEEP,GATK_ELIM
1,chr21_46356887_T_C,8.285483,0,0.400000,True,True,False,True,True
152,chr21_46357235_T_C,11.353105,0,0.600000,True,True,False,True,True
170,chr21_46357274_G_A,7.061645,0,0.666667,True,True,False,True,True
177,chr21_46363419_C_G,12.121438,0,0.333333,True,True,False,True,True
433,chr21_46366498_T_G,8.618328,0,0.400000,True,True,False,True,True
...,...,...,...,...,...,...,...,...,...
14473,chr21_46662002_A_ATGGGGCGCGCAGGAGGGGG,6.689206,0,0.666667,False,False,False,True,True
14477,chr21_46662019_G_C,5.135337,0,0.250000,True,False,False,True,True
14482,chr21_46662033_G_A,3.103009,0,1.000000,True,False,False,True,True
14636,chr21_46664200_C_G,9.176449,0,0.800000,True,True,False,True,True


Yeah gatk is not removing the incorrect rows.

In [56]:
print(len(set(bcf_py[~bcf_py.KEEP].ID)))
len(set(bcf_py[~bcf_py.KEEP].ID) - multi_allelic)

165


113