# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

Here the genotype matrix is simulated using real haplotypes from chromosome 22 of the 1000 Genomes Phase 1 data reference haplotypes. The entire dataset can be [downloaded here](ftp://share.sph.umich.edu/minimac3/G1K_P1_VCF_Files.tar.gz) (4.15GB).


In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays
using Plots

└ @ Revise /Users/biona001/.julia/packages/Revise/439di/src/Revise.jl:1108
┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273


In [7]:
vcffile = "compare4/VCF_Files/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz"
@show nsamples(vcffile)
@show nrecords(vcffile)

nsamples(vcffile) = 1092
nrecords(vcffile) = 365644


365644

# Simulate data

Reference haplotypes are obtained from the 1000 genomes project as above. We use this haplotype matrix to construct a genotype matrix matrix with 100 samples. The genotype matrix are randomly masked to generate the missing values. The result is stored in 3 files:

+ `ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz`: haplotype reference files
+ `target.vcf.gz`: complete genotype information
+ `target_masked.vcf.gz`: the same as `target.vcf.gz` except some entries are masked

In [2]:
# import haplotype matrix and simulate genotype matrix
@time H = convert_ht(Bool, "compare4/VCF_Files/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz")
@time X = simulate_genotypes(H', 100)

# randomly mask entries
Random.seed!(2020)
missingprop = 0.1
snps, people = size(X)
@time Xm = ifelse.(rand(snps, people) .< missingprop, missing, X);

 52.578988 seconds (813.83 M allocations: 73.839 GiB, 17.03% gc time)
  1.787696 seconds (1.44 M allocations: 477.925 MiB, 1.58% gc time)
  0.850494 seconds (807.98 k allocations: 914.419 MiB, 2.87% gc time)


In [42]:
# extract CHROM, POS, ID, REF, ALT info from ref VCF file
vcffile = "compare4/VCF_Files/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz"
records = nrecords(vcffile)
chrom   = ["22" for i in 1:records]
pos     = zeros(Int, records)
ids     = Vector{String}(undef, records)
refs    = Vector{String}(undef, records)
alts    = Vector{String}(undef, records)
reader  = VCF.Reader(openvcf(vcffile, "r"))
for (i, record) in enumerate(reader)
    pos[i]   = VCF.pos(record)
    ids[i]   = VCF.id(record)[1]
    refs[i]  = VCF.ref(record)
    alts[i]  = VCF.alt(record)[1]
end

# write simulated genotype matrices to disk
@time make_tgtvcf_file(X, vcffilename="./compare4/target.vcf.gz", marker_chrom=chrom, 
      marker_pos=pos, marker_ID=ids, marker_REF=refs, marker_ALT=alts)
@time make_tgtvcf_file(Xm, vcffilename="./compare4/target_masked.vcf.gz", marker_chrom=chrom, 
      marker_pos=pos, marker_ID=ids, marker_REF=refs, marker_ALT=alts)

  7.101095 seconds (805.33 k allocations: 37.253 MiB)
 10.362518 seconds (733.25 k allocations: 33.556 MiB, 0.19% gc time)


# MendelImpute error

In [4]:
# read original target data (all entries observed)
X_complete = convert_gt(Float32, "./compare4/target.vcf.gz"; as_minorallele=false)
Xm = convert_gt(Float32, "./compare4/target_masked.vcf.gz"; as_minorallele=false)
Xm_original = copy(Xm)

100×365644 Array{Union{Missing, Float32},2}:
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
  missing  0.0       0.0       0.0          0.0       0.0       0.0     
 0.0        missing  0.0       0.0          0.0        missing  0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0        missing     0.0       0.0        missing
 0.0       1.0       0.0       0.0       …  0.0       0.0        missing
 0.0       1.0       0.0       0.0          0.0       0.0        missing
 0.0       0.0       0.0       0.0          0.0        missing  0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0        missing   missing  0.0       …  0.0       0.0       0.0     
 0.0       0.0        missing  0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0        missing  0.0     
 ⋮    

### Fastest code with flanking windows

In [8]:
# impute
tgtfile = "./compare4/target_masked.vcf.gz"
reffile = "./compare4/VCF_Files/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz"
outfile = "./compare4/imputed_target.vcf.gz"
width   = 400
@time hs, ph = phase(tgtfile, reffile, impute=true, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, as_minorallele=false)
n, p = size(X_mendel)
error_rate = sum(X_complete .!= X_mendel) / p / n

366.124572 seconds (1.08 G allocations: 104.805 GiB, 3.45% gc time)


7.559265296299132e-5

### Most accurate code with flanking windows

In [15]:
# impute
tgtfile = "./compare4/target_masked.vcf.gz"
reffile = "./compare4/VCF_Files/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz"
outfile = "./compare4/imputed_target.vcf.gz"
width   = 400
@time hs, ph = phase(tgtfile, reffile, impute=true, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, as_minorallele=false)
n, p = size(X_mendel)
error_rate = sum(X_complete .!= X_mendel) / p / n

356.042449 seconds (1.11 G allocations: 105.721 GiB, 3.22% gc time)


7.181849011606919e-5

# Beagle 5 error

Beagle requires >14GB of RAM to run. Thus it can only be done on hoffman.

In [None]:
# following took 24 minutes 22 seconds on n7080 of Hoffman2 using 30GB
run(`java -Xmx14g -jar beagle.28Sep18.793.jar gt=./compare4/target_masked.vcf.gz ref=./compare4/VCF_Files/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz out=./compare4/beagle.result`)

In [11]:
# beagle 5 error rate
X_complete = convert_gt(Float32, "./compare4/target.vcf.gz"; as_minorallele=false)
X_beagle = convert_gt(Float32, "./compare4/beagle.result.vcf.gz", as_minorallele=false)
n, p = size(X_complete)
error_rate = sum(X_complete .!= X_beagle) / p / n

6.837251534279245e-7

# Minimac4 error

Need to first convert reference vcf file to m3vcf using minimac3 (on Hoffman)

```Julia
minimac3 = "/u/home/b/biona001/haplotype_comparisons/minimac3/Minimac3/bin/Minimac3"
@time run(`$minimac3 --refHaps ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.vcf.gz --processReference --prefix haplo_ref`)

#on n4034: 2675.055412 seconds (15.31 k allocations: 861.116 KiB)
```

In [44]:
# run minimac 4
minimac4 = "/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4"
run(`$minimac4 --refHaps ./compare4/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.m3vcf.gz --haps ./compare4/target_masked.vcf.gz --prefix ./compare4/minimac4.result`)



 -------------------------------------------------------------------------------- 
          Minimac4 - Fast Imputation Based on State Space Reduction HMM
 --------------------------------------------------------------------------------
           (c) 2014 - Sayantan Das, Christian Fuchsberger, David Hinds
                             Mary Kate Wing, Goncalo Abecasis 

 Version: 1.0.2;
 Built: Mon Sep 30 11:52:22 PDT 2019 by biona001

 Command Line Options: 
       Reference Haplotypes : --refHaps [./compare4/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.m3vcf.gz],
                              --passOnly, --rsid, --referenceEstimates [ON],
                              --mapFile [docs/geneticMapFile.b38.map.txt.gz]
          Target Haplotypes : --haps [./compare4/target_masked.vcf.gz]
          Output Parameters : --prefix [./compare4/minimac4.result],
                              --estimate, --nobgzip, --vcfBuffer [200],
                              --format [GT,D

Process(`[4m/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4[24m [4m--refHaps[24m [4m./compare4/ALL.chr22.phase1_v3.snps_indels_svs.genotypes.all.noSingleton.m3vcf.gz[24m [4m--haps[24m [4m./compare4/target_masked.vcf.gz[24m [4m--prefix[24m [4m./compare4/minimac4.result[24m`, ProcessExited(0))

In [2]:
X_complete = convert_gt(Float32, "./compare4/target.vcf.gz"; as_minorallele=false)
X_minimac = convert_gt(Float32, "./compare4/minimac4.result.dose.vcf.gz", as_minorallele=false)
n, p = size(X_complete)
error_rate = sum(X_complete .!= X_minimac) / n / p

0.011691317237531588