# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

Here data is simulated by myself. Code can be found [here](https://github.com/biona001/MendelImpute/blob/flanking_window/src/simulate_utilities.jl#L179). 

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays
using Plots

└ @ Revise /Users/biona001/.julia/packages/Revise/439di/src/Revise.jl:1108
┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273


# Simulate data

We simulate 10000 reference haplotypes each with 30000 SNPs. Then we use this haplotype matrix to construct a genotype matrix matrix with 1000 samples and 30000 SNPs. The genotype matrix are randomly masked to generate the missing values. The result is stored in 3 files:

+ `haplo_ref.vcf.gz`: haplotype reference files
+ `target.vcf.gz`: complete genotype information
+ `target_masked.vcf.gz`: the same as `target.vcf.gz` except some entries are masked

In [13]:
# For the case, snps = 3000, haps = 1000, people = 100, error rates are: 
# MendelImpute (slow) = 0.00421 (2.3 sec), MendelImpute (fast) = 0.00471 (1.4 sec), beagle = 0.04188 (8 sec), minimac4 = 0.0301 (33 + 21 sec)
snps   = 3000
haps   = 1000
people = 100

# full haplotype and genotype matrix
@time H = simulate_markov_haplotypes(snps, haps)
@time X = simulate_genotypes(H, people)

# randomly mask entries
Random.seed!(2020)
missingprop = 0.1
@time Xm = ifelse.(rand(snps, people) .< missingprop, missing, X)

# write 3 files to disk
@time make_refvcf_file(H, vcffilename="./compare3/haplo_ref.vcf.gz")
@time make_tgtvcf_file(X, vcffilename="./compare3/target.vcf.gz")
@time make_tgtvcf_file(Xm, vcffilename="./compare3/target_masked.vcf.gz")

  0.040264 seconds (7 allocations: 366.531 KiB)
  0.002332 seconds (4.83 k allocations: 2.887 MiB)
  0.014382 seconds (23 allocations: 7.153 MiB, 71.58% gc time)
  1.245475 seconds (10.55 M allocations: 528.691 MiB, 2.64% gc time)
  0.137681 seconds (24.05 k allocations: 1.265 MiB)
  0.140974 seconds (24.05 k allocations: 1.265 MiB)


# MendelImpute error

In [15]:
# read original target data (all entries observed)
X_complete = convert_gt(Float32, "./compare3/target.vcf.gz"; as_minorallele=false)
Xm = convert_gt(Float32, "./compare3/target_masked.vcf.gz"; as_minorallele=false)
Xm_original = copy(Xm)

100×3000 Array{Union{Missing, Float32},2}:
 1.0       1.0       0.0       1.0       …  1.0       2.0       2.0     
 0.0        missing  0.0       1.0           missing   missing  1.0     
 1.0       2.0       2.0       2.0          2.0       1.0       1.0     
 1.0       1.0       1.0       0.0          1.0       1.0       1.0     
  missing  1.0       2.0       1.0          2.0       2.0       1.0     
  missing  1.0       1.0       2.0       …  0.0       0.0       0.0     
 2.0       1.0        missing  2.0          2.0       1.0       2.0     
 1.0       1.0       0.0        missing     1.0       1.0       1.0     
 1.0       0.0       0.0       0.0          2.0       2.0        missing
 1.0       1.0       1.0       1.0          1.0       1.0       1.0     
 2.0       1.0       2.0       0.0       …  0.0       1.0       1.0     
 1.0       1.0       0.0       0.0          2.0       2.0       1.0     
 1.0       0.0       1.0       0.0          0.0       1.0       1.0     
 ⋮      

### Fastest code with flanking windows

In [21]:
# impute
tgtfile = "./compare3/target_masked.vcf.gz"
reffile = "./compare3/haplo_ref.vcf.gz"
outfile = "./compare3/imputed_target.vcf.gz"
width   = 400
@time hs, ph = phase(tgtfile, reffile, impute=true, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, as_minorallele=false)
p, n = size(X_mendel)
error_rate = sum(X_complete .!= X_mendel) / p / n

  1.400644 seconds (5.30 M allocations: 499.802 MiB)


0.004706666666666666

### Most accurate code with flanking windows

In [19]:
# impute
tgtfile = "./compare3/target_masked.vcf.gz"
reffile = "./compare3/haplo_ref.vcf.gz"
outfile = "./compare3/imputed_target.vcf.gz"
width   = 400
@time hs, ph = phase(tgtfile, reffile, impute=true, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile, as_minorallele=false)
p, n = size(X_mendel)
error_rate = sum(X_complete .!= X_mendel) / p / n

  1.855045 seconds (5.33 M allocations: 527.062 MiB, 2.43% gc time)


0.00411

# Beagle 5 error

In [18]:
# run beagle 5 and import imputed data 
run(`java -jar beagle.28Sep18.793.jar gt=./compare3/target_masked.vcf.gz ref=./compare3/haplo_ref.vcf.gz out=./compare3/beagle.result`)

# beagle 5 error rate
X_beagle = convert_gt(Float32, "./compare3/beagle.result.vcf.gz", as_minorallele=false)
error_rate = sum(X_complete .!= X_beagle) / p / n

beagle.28Sep18.793.jar (version 5.0)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.28Sep18.793.jar" to list command line argument
Start time: 10:38 AM PST on 13 Jan 2020

Command line: java -Xmx3641m -jar beagle.28Sep18.793.jar
  gt=./compare3/target_masked.vcf.gz
  ref=./compare3/haplo_ref.vcf.gz
  out=./compare3/beagle.result

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:         500
Study samples:             100

Window 1 (1:1-3000)
Reference markers:       3,000
Study markers:           3,000

Burnin  iteration 1:           1 second
Burnin  iteration 2:           0 seconds
Burnin  iteration 3:           0 seconds
Burnin  iteration 4:           0 seconds
Burnin  iteration 5:           1 second
Burnin  iteration 6:           0 seconds

Phasing iteration 1:           0 seconds
Phasing iteration 2:           0 seconds
Phasing iteration 3:           0 seconds
Phasing iteration 4:           0 seconds
Phasing iteration 5:           0 seconds
Phasi

0.04188

# Minimac4 error

Need to first convert reference vcf file to m3vcf using minimac3 (on Hoffman)

```Julia
minimac3 = "/u/home/b/biona001/haplotype_comparisons/minimac3/Minimac3/bin/Minimac3"
@time run(`$minimac3 --refHaps haplo_ref.vcf.gz --processReference --prefix haplo_ref`)
```

In [22]:
# run minimac 4
minimac4 = "/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4"
run(`$minimac4 --refHaps ./compare3/haplo_ref.m3vcf.gz --haps ./compare3/target_masked.vcf.gz --prefix ./compare3/minimac4.result`)



 -------------------------------------------------------------------------------- 
          Minimac4 - Fast Imputation Based on State Space Reduction HMM
 --------------------------------------------------------------------------------
           (c) 2014 - Sayantan Das, Christian Fuchsberger, David Hinds
                             Mary Kate Wing, Goncalo Abecasis 

 Version: 1.0.2;
 Built: Mon Sep 30 11:52:22 PDT 2019 by biona001

 Command Line Options: 
       Reference Haplotypes : --refHaps [./compare3/haplo_ref.m3vcf.gz],
                              --passOnly, --rsid, --referenceEstimates [ON],
                              --mapFile [docs/geneticMapFile.b38.map.txt.gz]
          Target Haplotypes : --haps [./compare3/target_masked.vcf.gz]
          Output Parameters : --prefix [./compare3/minimac4.result],
                              --estimate, --nobgzip, --vcfBuffer [200],
                              --format [GT,DS], --allTypedSites, --meta,
                       

Process(`[4m/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4[24m [4m--refHaps[24m [4m./compare3/haplo_ref.m3vcf.gz[24m [4m--haps[24m [4m./compare3/target_masked.vcf.gz[24m [4m--prefix[24m [4m./compare3/minimac4.result[24m`, ProcessExited(0))

In [23]:
X_minimac = convert_gt(Float32, "./compare3/minimac4.result.dose.vcf.gz", as_minorallele=false)
error_rate = sum(X_complete .!= X_minimac) / n / p

0.030166666666666665