# Compare MendelImpute against Minimac4 and Beagle5 on simulated data

In compare 2, we distinguish typed and untyped SNPs. 10% of typed SNPs are missing and only 50% of all SNPs are typed. 

In [1]:
using Revise
using VCFTools
using MendelImpute
using GeneticVariation
using Random
using SparseArrays
using JLD2, FileIO, JLSO
using ProgressMeter
using GroupSlices
using ThreadPools
# using Plots
# using ProfileView

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1273


# Simulate data

### Step 0. Install `msprime`

[msprime download Link](https://msprime.readthedocs.io/en/stable/installation.html).

Some people might need to activate conda environment via `conda config --set auto_activate_base True`. You can turn it off once simulation is done by executing `conda config --set auto_activate_base False`.


### Step 1. Simulate data in terminal

```
python3 msprime_script.py 40000 10000 10000000 2e-8 2e-8 2020 > full.vcf
```

Arguments: 
+ Number of haplotypes = 40000
+ Effective population size = 10000 ([source](https://www.the-scientist.com/the-nutshell/ancient-humans-more-diverse-43556))
+ Sequence length = 10 million (same as Beagle 5's choice)
+ Rrecombination rate = 2e-8 (default)
+ mutation rate = 2e-8 (default)
+ seed = 2019

### Step 2: Convert simulated haplotypes to reference haplotypes and target genotype files

In [13]:
cd("./compare2/")
function filter_and_mask(maf)
    # filter chromosome data for unique snps
    println("filtering for unique snps")
    data = "full.vcf"
    full_record_index = .!find_duplicate_marker(data)
    @time VCFTools.filter(data, full_record_index, 1:nsamples(data), 
        des = "full.uniqueSNPs.vcf.gz")

    # summarize data
    println("summarizing data")
    total_snps, samples, _, _, _, maf_by_record, _ = gtstats("full.uniqueSNPs.vcf.gz")

    # generate target panel with all snps
    println("generating complete target panel")
    n = 1000
    sample_idx = falses(samples)
    sample_idx[1:n] .= true
    shuffle!(sample_idx)
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", 1:total_snps, 
        sample_idx, des = "target.full.vcf.gz", allow_multiallelic=false)

    also generate reference panel without target samples
    println("generating reference panel without target samples")
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", 1:total_snps, 
        .!sample_idx, des = "ref.excludeTarget.vcf.gz", allow_multiallelic=false)

    # generate target file with 1000 samples and typed snps with certain maf
    println("generating target file with typed snps only")
    my_maf = findall(x -> x > maf, maf_by_record)  
    p = length(my_maf)
    record_idx = falses(total_snps)
    record_idx[my_maf] .= true
    @time VCFTools.filter("full.uniqueSNPs.vcf.gz", record_idx, sample_idx, 
        des = "target.typedOnly.maf$maf.vcf.gz", allow_multiallelic=false)

    # unphase and mask 1% entries in target file
    println("unphasing and masking entries in target file with typed snps only")
    masks = falses(p, n)
    missingprop = 0.1
    for j in 1:n, i in 1:p
        rand() < missingprop && (masks[i, j] = true)
    end
    @time mask_gt("target.typedOnly.maf$maf.vcf.gz", masks, 
        des="target.typedOnly.maf$maf.masked.vcf.gz", unphase=true)

    # finally compress reference file to jlso format
    widths  = [32, 64, 128, 256, 512]
    reffile = "ref.excludeTarget.vcf.gz"
    tgtfile = "target.typedOnly.maf$maf.masked.vcf.gz"
    H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, reffile, trans=true, save_snp_info=true, msg="importing reference data...")
    X, X_sampleID, X_chr, X_pos, X_ids, X_ref, X_alt = VCFTools.convert_gt(UInt8, tgtfile, trans=true, save_snp_info=true, msg = "Importing genotype file...")
    for width in widths
        outfile = "ref.excludeTarget.w$width.jlso"
        @time compress_haplotypes(H, X, outfile, X_pos, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt, width)
    end
end
Random.seed!(2020)
maf = 0.01
@time filter_and_mask(maf)

summarizing data


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:08:31[39m


generating complete target panel
generating target file with typed snps only
193.850375 seconds (3.78 G allocations: 288.466 GiB, 15.40% gc time)
unphasing and masking entries in target file with typed snps only
  8.499865 seconds (73.79 M allocations: 5.496 GiB, 7.70% gc time)


[32mimporting reference data...100%|████████████████████████| Time: 0:03:23[39m
[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m


137.175658 seconds (10.71 M allocations: 16.615 GiB, 1.54% gc time)
 95.677300 seconds (2.16 M allocations: 9.058 GiB, 1.69% gc time)
 76.646906 seconds (1.95 M allocations: 5.037 GiB, 1.10% gc time)
 65.061100 seconds (1.72 M allocations: 2.962 GiB, 0.32% gc time)
 58.832663 seconds (1.49 M allocations: 1.891 GiB, 0.42% gc time)
1377.550815 seconds (12.76 G allocations: 1.129 TiB, 8.42% gc time)


In [3]:
# compress reference file to jlso format
widths  = [32, 64, 128, 256, 512]
reffile = "./compare2/ref.excludeTarget.vcf.gz"
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
H, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt = convert_ht(Bool, reffile, trans=true, save_snp_info=true, msg="importing reference data...")
X, X_sampleID, X_chr, X_pos, X_ids, X_ref, X_alt = VCFTools.convert_gt(UInt8, tgtfile, trans=true, save_snp_info=true, msg = "Importing genotype file...")
for width in widths
    outfile = "./compare2/ref.excludeTarget.w$width.jlso"
    @time compress_haplotypes(H, X, outfile, X_pos, H_sampleID, H_chr, H_pos, H_ids, H_ref, H_alt, width)
end

[32mimporting reference data...100%|████████████████████████| Time: 0:03:39[39m
[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m


150.522498 seconds (26.58 M allocations: 17.362 GiB, 1.67% gc time)
113.844457 seconds (2.16 M allocations: 9.063 GiB, 1.52% gc time)
105.889895 seconds (1.95 M allocations: 5.043 GiB, 0.86% gc time)
 73.975767 seconds (1.72 M allocations: 2.967 GiB, 0.36% gc time)
 65.916813 seconds (1.49 M allocations: 1.896 GiB, 0.31% gc time)


In [3]:
# load jlso
@time loaded = JLSO.load("./compare2/ref.excludeTarget.w512.jlso")
compressed_Hunique = loaded[:compressed_Hunique];

  0.784125 seconds (1.77 M allocations: 206.863 MiB)


In [32]:
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.vcf.gz"
@show nrecords(tgtfile), nsamples(tgtfile)
@show nrecords(reffile), nsamples(reffile);

(nrecords(tgtfile), nsamples(tgtfile)) = (36874, 1000)
(nrecords(reffile), nsamples(reffile)) = (89913, 19000)


# MendelImpute with dynamic programming

In [22]:
# keep best pair only (1 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:11[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:01:12[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.07908 seconds
    Computing haplotype pair        = 11.4215 seconds
        BLAS3 mul! to get M and N      = 0.209483 seconds per thread
        haplopair search               = 9.27352 seconds per thread
        supplying constant terms       = 0.0368894 seconds per thread
        finding redundant happairs     = 0.800724 seconds per thread
    Phasing by dynamic programming  = 72.6589 seconds
    Imputation                      = 7.44617 seconds

 99.605137 seconds (77.18 M allocations: 7.509 GiB, 1.09% gc time)
error_rate = 0.0003560219323123464


In [3]:
# keep best pair only (8 threads)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mMerging breakpoints...100%|█████████████████████████████| Time: 0:00:11[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.06139 seconds
    Computing haplotype pair        = 3.04445 seconds
        BLAS3 mul! to get M and N      = 0.0601239 seconds per thread
        haplopair search               = 2.45815 seconds per thread
        supplying constant terms       = 0.00534291 seconds per thread
        finding redundant happairs     = 0.155477 seconds per thread
    Phasing by dynamic programming  = 11.9017 seconds
    Imputation                      = 7.78007 seconds

 30.787638 seconds (77.18 M allocations: 7.555 GiB, 3.16% gc time)
error_rate = 0.0003560219323123464


# MendelImpute with intersecting haplotype sets

In [29]:
# keep best pair only (1 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:10[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.93587 seconds
    Computing haplotype pair        = 10.8287 seconds
        BLAS3 mul! to get M and N      = 0.205239 seconds per thread
        haplopair search               = 9.21358 seconds per thread
        supplying constant terms       = 0.0362449 seconds per thread
        finding redundant happairs     = 0.14647 seconds per thread
    Phasing by dynamic programming  = 0.285193 seconds
    Imputation                      = 7.447 seconds

 27.497075 seconds (76.86 M allocations: 7.590 GiB, 4.95% gc time)
error_rate = 8.32693826254268e-5


In [4]:
# keep best pair only (8 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.4682 seconds
    Computing haplotype pair        = 3.15799 seconds
        BLAS3 mul! to get M and N      = 0.0703812 seconds per thread
        haplopair search               = 2.43136 seconds per thread
        supplying constant terms       = 0.00515782 seconds per thread
        finding redundant happairs     = 0.0605905 seconds per thread
    Phasing by dynamic programming  = 1.03799 seconds
    Imputation                      = 7.96755 seconds

 20.656982 seconds (78.63 M allocations: 7.666 GiB, 6.22% gc time)
error_rate = 8.32693826254268e-5


In [3]:
# keep best pair only (8 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mIntersecting haplotypes...100%|█████████████████████████| Time: 0:00:06[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.2824 seconds
    Computing haplotype pair        = 3.04796 seconds
        BLAS3 mul! to get M and N      = 0.0692387 seconds per thread
        haplopair search               = 2.42519 seconds per thread
        finding redundant happairs     = 0.230196 seconds per thread
    Phasing by win-win intersection = 6.7157 seconds
    Imputation                      = 7.6443 seconds

 25.690553 seconds (78.93 M allocations: 8.117 GiB, 4.05% gc time)
error_rate = 8.32693826254268e-5


# Haplotype thinning

In [3]:
# account for allele frequency (8 thread, BLAS3)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:06[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:18[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.09351 seconds
    Computing haplotype pair        = 19.4003 seconds
        computing dist(X, H)           = 0.0835071 seconds per thread
        BLAS3 mul! to get M and N      = 0.0698497 seconds per thread
        haplopair search               = 16.578 seconds per thread
        finding redundant happairs     = 0.0334275 seconds per thread
    Phasing by win-win intersection = 1.08949 seconds
    Imputation                      = 7.86786 seconds

 36.514919 seconds (83.09 M allocations: 8.048 GiB, 2.99% gc time)
error_rate = 8.47485903039605e-5


In [7]:
# account for allele frequency (8 thread, BLAS3)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=true);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:09[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.2541 seconds
    Computing haplotype pair        = 9.62084 seconds
        computing dist(X, H)           = 0.0777321 seconds per thread
        BLAS3 mul! to get M and N      = 0.0672944 seconds per thread
        haplopair search               = 8.29248 seconds per thread
        finding redundant happairs     = 0.0243294 seconds per thread
    Phasing by win-win intersection = 0.199977 seconds
    Imputation                      = 7.78355 seconds

 25.858466 seconds (77.08 M allocations: 7.910 GiB, 4.31% gc time)
error_rate = 8.35029417325637e-5


In [6]:
# account for allele frequency (8 thread, repeated BLAS2)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:02:40[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:11[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 8.45103 seconds
    Computing haplotype pair        = 161.011 seconds
        computing dist(X, H)           = 0.30734 seconds per thread
        BLAS3 mul! to get M and N      = 138.637 seconds per thread
        haplopair search               = 11.381 seconds per thread
        finding redundant happairs     = 0.0473707 seconds per thread
    Phasing by win-win intersection = 0.437003 seconds
    Imputation                      = 14.5971 seconds

184.494531 seconds (77.52 M allocations: 7.704 GiB, 0.62% gc time)
error_rate = 8.523795224272351e-5


In [3]:
# account for allele frequency (8 thread, repeated BLAS2)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=true);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:02:16[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:07[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 11.1718 seconds
    Computing haplotype pair        = 137.322 seconds
        computing dist(X, H)           = 0.20913 seconds per thread
        BLAS3 mul! to get M and N      = 120.249 seconds per thread
        haplopair search               = 9.6079 seconds per thread
        finding redundant happairs     = 0.0608349 seconds per thread
    Phasing by win-win intersection = 2.07374 seconds
    Imputation                      = 9.9719 seconds

160.622111 seconds (100.26 M allocations: 8.957 GiB, 1.00% gc time)
error_rate = 8.346957614582985e-5


# Try Lasso

In [3]:
# keep best pair only (8 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    lasso = 20, dynamic_programming=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
# X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:08[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:05[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:07[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 9.14087 seconds
    Computing haplotype pair        = 6.42583 seconds
        BLAS3 mul! to get M and N      = 5.02618 seconds per thread
        haplopair search               = 0.327736 seconds per thread
        finding redundant happairs     = 0.0284033 seconds per thread
    Phasing by win-win intersection = 0.233125 seconds
    Imputation                      = 8.88681 seconds

 24.848782 seconds (77.85 M allocations: 7.633 GiB, 5.19% gc time)
error_rate = 8.673940364574645e-5


In [3]:
# keep best pair only (8 thread)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    lasso = 100, dynamic_programming=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:08[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:07[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:08[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 10.3558 seconds
    Computing haplotype pair        = 8.18339 seconds
        BLAS3 mul! to get M and N      = 5.77199 seconds per thread
        haplopair search               = 1.31318 seconds per thread
        finding redundant happairs     = 0.0313764 seconds per thread
    Phasing by win-win intersection = 0.244059 seconds
    Imputation                      = 9.67417 seconds

 28.457500 seconds (77.15 M allocations: 7.597 GiB, 5.51% gc time)
error_rate = 8.536029272741427e-5


# Beagle 5.1 Error

In [None]:
# convert to bref3 (run in terminal)
java -jar ../bref3.18May20.d20.jar ref.excludeTarget.vcf.gz > ref.excludeTarget.bref3 

In [31]:
# run beagle 5 (8 thread)
run(`java -jar beagle.18May20.d20.jar gt=compare2/target.typedOnly.maf0.01.masked.vcf.gz ref=compare2/ref.excludeTarget.bref3 out=compare2/beagle.result nthreads=8`)

# beagle 5 error rate
X_complete = convert_gt(Float32, "compare2/target.full.vcf.gz")
n, p = size(X_complete)
X_beagle = convert_gt(Float32, "compare2/beagle.result.vcf.gz")
error_rate = sum(X_beagle .!= X_complete) / n / p

beagle.18May20.d20.jar (version 5.1)
Copyright (C) 2014-2018 Brian L. Browning
Enter "java -jar beagle.18May20.d20.jar" to list command line argument
Start time: 01:01 AM PDT on 30 Jun 2020

Command line: java -Xmx3641m -jar beagle.18May20.d20.jar
  gt=compare2/target.typedOnly.maf0.01.masked.vcf.gz
  ref=compare2/ref.excludeTarget.bref3
  out=compare2/beagle.result
  nthreads=8

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:      19,000
Study samples:           1,000

Window 1 (1:34-9999816)
Reference markers:      89,913
Study markers:          36,874

Burnin  iteration 1:           36 seconds
Burnin  iteration 2:           32 seconds
Burnin  iteration 3:           35 seconds
Burnin  iteration 4:           43 seconds
Burnin  iteration 5:           37 seconds
Burnin  iteration 6:           1 minute 16 seconds

Phasing iteration 1:           2 minutes 0 seconds
Phasing iteration 2:           41 seconds
Phasing iteration 3:           38 seconds
Phasing iteration 4:  

1.7794979591382782e-5

# Minimac4 error

Need to first convert reference vcf file to m3vcf using minimac3 (on Hoffman)

```Julia
minimac3 = "/u/home/b/biona001/haplotype_comparisons/minimac3/Minimac3/bin/Minimac3"
@time run(`$minimac3 --refHaps haplo_ref.vcf.gz --processReference --prefix haplo_ref`)
```

In [None]:
# use eagle 2.4 for prephasing

In [None]:
# run minimac 4
minimac4 = "/Users/biona001/Benjamin_Folder/UCLA/research/softwares/Minimac4/build/minimac4"
run(`$minimac4 --refHaps haplo_ref.m3vcf.gz --haps target_masked.vcf.gz --prefix minimac4.result`)
    
X_minimac = convert_gt(Float32, "minimac4.result.dose.vcf.gz", as_minorallele=false)
error_rate = sum(X_minimac .!= X_complete) / n / p

# BLAS 3

In [2]:
# account for allele frequency (8 thread, BLAS3)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs...100%|███████████████| Time: 0:00:19[39m
[32mWriting to file...100%|█████████████████████████████████| Time: 0:00:06[39m


Total windows = 72, averaging ~ 627 unique haplotypes per window.

Timings: 
    Data import                     = 11.2299 seconds
    Computing haplotype pair        = 20.639 seconds
        computing dist(X, H)           = 0.10026 seconds per thread
        BLAS3 mul! to get M and N      = 0.0936877 seconds per thread
        haplopair search               = 13.9145 seconds per thread
        finding redundant happairs     = 0.0360827 seconds per thread
    Phasing by win-win intersection = 1.08454 seconds
    Imputation                      = 8.11408 seconds

 59.805070 seconds (161.01 M allocations: 12.053 GiB, 4.43% gc time)
error_rate = 8.47485903039605e-5


# BLAS 2

In [None]:
# account for allele frequency (8 thread, BLAS3)
Random.seed!(2020)
width   = 512
tgtfile = "./compare2/target.typedOnly.maf0.01.masked.vcf.gz"
reffile = "./compare2/ref.excludeTarget.w$width.jlso"
outfile = "./compare2/mendel.imputed.vcf.gz"
@time hs, ph = phase(tgtfile, reffile, outfile = outfile, width = width,
    dynamic_programming = false, thinning_factor=2000, thinning_scale_allelefreq=false);

# import imputed result and compare with true
X_mendel = convert_gt(Float32, outfile)
X_complete = convert_gt(Float32, "./compare2/target.full.vcf.gz")
n, p = size(X_mendel)
println("error_rate = ", sum(X_mendel .!= X_complete) / n / p)
rm(outfile, force=true)

[32mImporting genotype file...100%|█████████████████████████| Time: 0:00:07[39m
[32mComputing optimal haplotype pairs... 71%|██████████▋    |  ETA: 0:01:21[39m