# Generate HMM knockoffs

This tutorial closely follows the [knockoffgwas tutorial](https://msesia.github.io/knockoffgwas/tutorial.html). We go over how to generate (SHAPEIT) HMM knockoffs given (simulated) binary PLINK formatted data. 

## Installation

1. Install [knockoffgwas](https://github.com/msesia/knockoffgwas) and its dependencies
2. Install [qctools](https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/download.html) for converting between VCF and BGEN formats
3. Install [RaPID](https://github.com/ZhiGroup/RaPID) for detecting IBD segments (**this only run on linux**)
4. Install the following Julia packages. Within julia, type
```julia
]add SnpArrays Distributions ProgressMeter MendelIHT VCFTools StatsBase
]add https://github.com/biona001/Knockoffs.jl
```
5. **Modify these executable path to the ones installed on your local computer.** Here the `partition_script` is located under the path you installed knockoffgwas: `knockoffgwas/knockoffgwas/utils/partition.R`.

In [None]:
qctools_exe = "/scratch/users/bbchu/qctool/build/release/qctool_v2.0.7"
snpknock2_exe = "/scratch/users/bbchu/knockoffgwas/snpknock2/bin/snpknock2"
rapid_exe = "/scratch/users/bbchu/RaPID/RaPID_v.1.7"
partition_script = "/scratch/users/bbchu/knockoffgwas/knockoffgwas/utils/partition.R";

## Required inputs

We need multiple input files to generate knockoffs

+ Unphased genotypes in binary PLINK format
+ Phased genotypes in VCF and BGEN format: we will simulate haplotypes, store in VCF format, and convert to BGEN using [qctools](https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/converting.html)
    + Note: knockoffgwas requires only BGEN format, but RaPID requires VCF formats. Hence, the extra conversion
+ Map file (providing different group resolution): since data is simulated, we will generate fake map file where every snp is 1cM apart
+ IBD segment file (generated by [RaPID](https://github.com/ZhiGroup/RaPID) which requires VCF inputs)
+ Variant partition files (generated by snpknock2, i.e. module 2 of [knockoffgwas](https://github.com/msesia/knockoffgwas)) 
+ Sample and variant QC files

## Simulate genotypes


Our simulation will try to follow
+ Adding population structure (admixture & cryptic relatedness): [New approaches to population stratification in genome-wide association studies](https://www.nature.com/articles/nrg2813)
+ How to simulate siblings: [Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits
](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003520)

### Population structure

Specifically, lets simulate genotypes with 2 populations. We simulate 49950 normally differentiated markers and 50 unusually differentiated markers based on allele frequency difference equal to 0.6. Let $x_{ij}$ be the number of alternate allele count for sample $i$ at SNP $j$ with allele frequency $p_j$. Also let $h_{ij, 1}$ denotype haplotype 1 of sample $i$ at SNP $j$ and $h_{ij, 2}$ the second haplotype. Our simulation model is

$$h_{ij, 1} \sim Bernoulli(p_j), \quad h_{ij, 2} \sim Bernoulli(p_j), \quad x_{ij} = h_{ij, 1} + h_{ij, 2}$$

which is equivalent to  

$$x_{ij} \sim Binomial(2, p_j)$$

for unphased data. The allele frequency is $p_{j} = Uniform(0, 1)$ for normally differentiated markers, and 

$$p_{pop1, j} \sim Uniform(0, 0.4), \quad p_{pop2, j} = p_{pop1, j} + 0.6$$

for abnormally differentiated markers. Each sample is randomly assigned to population 1 or 2. 

### Sibling pairs

Based on the simulated data above, we can randomly sample pairs of individuals and have them produce offspring. Here, half of all offsprings will be siblings with the other half. This is done by first randomly sampling 2 samples from to represent parent. Assume they have 2 children. Then generate offspring individuals by copying segments of one parent haplotype directly to the corresponding haplotype of the offspring. This recombination event will produce IBD segments. The number of recombination is 1 or 2 per chromosome, and is chosen uniformly across the chromosome. 

In [1]:
# load Julia packages and some helper functions
using SnpArrays
using Knockoffs
using DelimitedFiles
using Random
using LinearAlgebra
using Distributions
using ProgressMeter
using MendelIHT
using VCFTools
using StatsBase

These are helper functions needed for this tutorial (essentially data simulation + some glue code). It is not crucial users understand what they are doing.  

In [3]:
"""
    simulate_pop_structure(n, p)

Simulate genotypes with K = 2 populations. 300 SNPs will have different allele 
frequencies between the populations, where 50 of them will be causal

# Inputs
- `plinkfile`: Output plink file name. 
- `n`: Number of samples
- `p`: Number of SNPs

# Output
- `x1`: n×p matrix of the 1st haplotype for each sample. Each row is a haplotype
- `x2`: n×p matrix of the 2nd haplotype for each sample. `x = x1 + x2`
- `populations`: Vector of length `n` indicating population membership for eachsample. 
- `diff_markers`: Indices of the differentially expressed alleles.

# Reference
https://www.nature.com/articles/nrg2813
"""
function simulate_pop_structure(n::Int, p::Int)
    # first simulate genotypes treating all samples equally
    x1 = BitMatrix(undef, n, p)
    x2 = BitMatrix(undef, n, p)
    pmeter = Progress(p, 0.1, "Simulating genotypes...")
    @inbounds for j in 1:p
        d = Bernoulli(rand())
        for i in 1:n
            x1[i, j] = rand(d)
            x2[i, j] = rand(d)
        end
        next!(pmeter)
    end
    # assign populations and simulate 300 unually differentiated markers
    populations = rand(1:2, n)
    diff_markers = sample(1:p, 300, replace=false)
    @inbounds for j in diff_markers
        pop1_allele_freq = 0.4rand()
        pop2_allele_freq = pop1_allele_freq + 0.6
        pop1_dist = Bernoulli(pop1_allele_freq)
        pop2_dist = Bernoulli(pop2_allele_freq)
        for i in 1:n
            d = isone(populations[i]) ? pop1_dist : pop2_dist
            x1[i, j] = rand(d)
            x2[i, j] = rand(d)
        end
    end
    return x1, x2, populations, diff_markers
end

"""
    simulate_IBD(parent_plinkfile, offsprings)

Simulate recombination events. Half of all offsprings will be siblings with the other half.
This is done by first randomly sampling 2 samples from to represent parent. Assume they have
2 children. Then generate offspring individuals by copying segments of the parents haplotype
directly to the offspring to represent IBD segments. The number of segments (i.e. places of
recombination) is 1 or 2 per chromosome, and is chosen uniformly across the chromosome. 

# Inputs
- `h1`: `n × p` matrix of the 1st haplotype for each parent. Each row is a haplotype
- `h2`: `n × p` matrix of the 2nd haplotype for each parent. `H = h1 + h2`
- `offsprings`: Total number of offsprings

# Output
- `x1`: `offsprings × p` matrix of the 1st haplotype for each offspring. Each row is a haplotype
- `x2`: `offsprings × p` matrix of the 2nd haplotype for each offspring. `x = x1 + x2`

# References
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003520
"""
function simulate_IBD(h1::AbstractMatrix, h2::AbstractMatrix, offsprings::Int)
    # first handle errors
    n, p = size(h1)
    iseven(offsprings) || error("number of offsprings should be even")
    # randomly designate gender for parents
    sex = bitrand(n)
    male_idx = findall(x -> x == true, sex)
    female_idx = findall(x -> x == false, sex)
    # simulate new samples
    x1 = falses(n, p)
    x2 = falses(n, p)
    fathers = Int[]
    mothers = Int[]
    pmeter = Progress(offsprings >> 1, 0.1, "Simulating IBD segments...")
    for i in 1:(offsprings >> 1)
        # assign parents
        dad = rand(male_idx)
        mom = rand(female_idx)
        push!(fathers, dad)
        push!(mothers, mom)
        # child 1
        recombinations = rand(1:2)
        breakpoints = sort!(sample(1:p, recombinations, replace=false))
        parent1, parent2 = rand() < 0.5 ? (dad, mom) : (mom, dad)
        segments = recombination_segments(breakpoints, p)
        for j in 1:length(segments)
            parent = isodd(j) ? parent1 : parent2
            segment = segments[j]
            # perform recombination
            parent_hap = rand() < 0.5 ? h1 : h2
            child_hap = rand() < 0.5 ? x1 : x2
            copyto!(@view(child_hap[2i - 1, segment]), @view(parent_hap[parent, segment]))
        end
        # child 2
        recombinations = rand(1:2)
        breakpoints = sort!(sample(1:p, recombinations, replace=false))
        parent1, parent2 = rand() < 0.5 ? (dad, mom) : (mom, dad)
        segments = recombination_segments(breakpoints, p)
        for j in 1:length(segments)
            parent = isodd(j) ? parent1 : parent2
            segment = segments[j]
            # perform recombination
            parent_hap = rand() < 0.5 ? h1 : h2
            child_hap = rand() < 0.5 ? x1 : x2
            copyto!(@view(child_hap[2i - 1, segment]), @view(parent_hap[parent, segment]))
        end
        # update progress
        next!(pmeter)
    end
    return x1, x2, fathers, mothers
end

function recombination_segments(breakpoints::Vector{Int}, snps::Int)
    start = 1
    result = UnitRange{Int}[]
    for bkpt in breakpoints
        push!(result, start:bkpt)
        start = bkpt + 1
    end
    push!(result, breakpoints[end]+1:snps)
    return result
end

function write_plink(outfile::AbstractString, x1::AbstractMatrix, x2::AbstractMatrix)
    n, p = size(x1)
    x = SnpArray(outfile * ".bed", n, p)
    for j in 1:p, i in 1:n
        c = x1[i, j] + x2[i, j]
        if c == 0
            x[i, j] = 0x00
        elseif c == 1
            x[i, j] = 0x02
        elseif c == 2
            x[i, j] = 0x03
        else
            error("matrix entries should be 0, 1, or 2 but was $c!")
        end
    end
    # create .bim file structure: https://www.cog-genomics.org/plink2/formats#bim
    open(outfile * ".bim", "w") do f
        for i in 1:p
            println(f, "1\tsnp$i\t0\t$(100i)\t1\t2")
        end
    end
    # create .fam file structure: https://www.cog-genomics.org/plink2/formats#fam
    open(outfile * ".fam", "w") do f
        for i in 1:n
            println(f, "$i\t1\t0\t0\t1\t-9")
        end
    end
    return nothing
end

function make_fake_mapfile(filename, p::Int)
    open(filename, "w") do io
        println(io, "Chromosome\tPosition(bp)\tRate(cM/Mb)\tMap(cM)")
        for i in 1:p
            println(io, "chr1\t", 100i, "\t1.0\t1.0")
        end
    end
end

make_fake_mapfile (generic function with 1 method)

Simulate phased data with 2 populations, 49700 usually differentiated markers, and 300 unusually differentiated markers. Then simulate mating, which generates IBD segments. Finally, make unphased data from offspring haplotypes. 

In [4]:
# simulate phased genotypes
Random.seed!(2021)
outfile = "sim"
n = 2000
p = 50000
h1, h2, populations, diff_markers = simulate_pop_structure(n, p)

# simulate random mating to get IBD segments
offsprings = 2000
x1, x2, fathers, mothers = simulate_IBD(h1, h2, offsprings)

# write phased genotypes to VCF format
write_vcf("sim.phased.vcf.gz", x1, x2)

# write unphased genotypes to PLINK binary format
write_plink(outfile, x1, x2)

# save pop1/pop2 index and unually differentiated marker indices
writedlm("populations.txt", populations)
writedlm("diff_markers.txt", diff_markers)

[32mSimulating genotypes...100%|████████████████████████████| Time: 0:00:02[39m
[32mSimulating IBD segments...100%|█████████████████████████| Time: 0:00:03[39m
[32mWriting VCF...100%|█████████████████████████████████████| Time: 0:00:21[39m


## Step 1: Partitions

We need
+ Map file (in particular the (cM) field will determine group resolution)
+ PLINK's bim file
+ QC file (all SNP names that pass QC)
+ output file name

Since data is simulated, there are no genomic map file. Let us generate a fake one. 

In [5]:
# generate fake map file
make_fake_mapfile("sim.map", p)

# also generate QC file that contains all SNPs and all samples
snpdata = SnpData("sim")
snpIDs = snpdata.snp_info[!, :snpid]
sampleIDs = Matrix(snpdata.person_info[!, 1:2])
writedlm("variants_qc.txt", snpIDs)
writedlm("samples_qc.txt", sampleIDs)

Now we run the partition script

In [6]:
plinkfile = "sim"
mapfile = "sim.map"
qc_variants = "variants_qc.txt"
outfile = "sim.partition.txt"
partition(partition_script, plinkfile, mapfile, qc_variants, outfile)

1: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 
2: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 
3: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 


Mean group sizes: 
res_7 res_6 res_5 res_4 res_3 res_2 res_1 
    1 10000 10000 10000 10000 10000 10000 
Partitions written to: sim.partition.txt


Process(`[4mRscript[24m [4m--vanilla[24m [4m/scratch/users/bbchu/knockoffgwas/knockoffgwas/utils/partition.R[24m [4msim.map[24m [4msim.bim[24m [4mvariants_qc.txt[24m [4msim.partition.txt[24m`, ProcessExited(0))

## Step 2: Generate Knockoffs

First generate IBD segment files. This requires postprocessing the output of RapID, as [described here](https://github.com/msesia/knockoffgwas/issues/2). 

In [12]:
vcffile = "sim.phased.vcf.gz"
mapfile = "sim.map"
min_length = 1    # minimum IBD length in cM
outfolder = "rapid"
window_size = 5     # number of SNPs per window
r = 1              # number of runs
s = 10               # number of successes
rapid(rapid_exe, vcffile, mapfile, min_length, outfolder, window_size, r, s)

Create sub-samples..
Done!
0 %100 %

Process(`[4m/scratch/users/bbchu/RaPID/RaPID_v.1.7[24m [4m-i[24m [4msim.phased.vcf.gz[24m [4m-g[24m [4msim.map[24m [4m-d[24m [4m1[24m [4m-o[24m [4mrapid[24m [4m-w[24m [4m5[24m [4m-r[24m [4m1[24m [4m-s[24m [4m10[24m`, ProcessExited(0))

Next convert VCF file to BGEN format (note: sample file must be saved separately)

In [13]:
# convert VCF to BGEN format
outfile = "sim.bgen"
run(`$qctools_exe -g $vcffile -og $outfile`)

# then save sample file separately
open("sim.sample", "w") do io
    println(io, "ID_1 ID_2 missing sex")
    println(io, "0 0 0 D")
    for i in 1:n
        println(io, "$i 1 0 1")
    end
end


Welcome to qctool
(version: 2.0.7, revision )

(C) 2009-2017 University of Oxford

Opening genotype files                                      : [******************************] (1/1,0.0s,41.3/s)

Input SAMPLE file(s):         Output SAMPLE file:             "(n/a)".
Sample exclusion output file:   "(n/a)".

Input GEN file(s):
                                                    (not computed)  "sim.phased.vcf.gz"
                                         (total 1 sources, number of snps not computed).
                      Number of samples: 2000
Output GEN file(s):             "sim.bgen"
Output SNP position file(s):    (n/a)
Sample filter:                  .
# of samples in input files:    2000.
# of samples after filtering:   2000 (0 filtered out).


Processing SNPs                                             :  (50000/?,142.7s,350.4/s)39809/?,119.8s,332.3/s):  (42338/?,126.9s,333.7/s)
Total: 50000SNPs.

Number of SNPs:
                     -- in input file(s):                 (not c

The first few lines of the BGEN `.sample` file looks like:

In [18]:
;head sim.sample

ID_1 ID_2 missing sex
0 0 0 D
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1


Finally, generate HMM knockoffs by running the following code in the command line directly. You may need to adjust file directories and change parameters. 

In [19]:
bgenfile = "sim"
sample_qc = "samples_qc.txt"
variant_qc = "variants_qc.txt"
mapfile = "sim.map"
partfile = "sim.partition.txt"
ibdfile = "rapid/results.max"
K = 10
cluster_size_min = 1000 
cluster_size_max = 10000 
hmm_rho = 1
hmm_lambda = 1e-3 
windows = 0
n_threads = 1
seed = 2020
compute_references = true
generate_knockoffs = true
outfile = "sim.knockoffs"

@time snpknock2(snpknock2_exe, bgenfile, sample_qc, variant_qc, mapfile, partfile, ibdfile, 
    K, cluster_size_min, cluster_size_max, hmm_rho, hmm_lambda, windows, n_threads, 
    seed, compute_references, generate_knockoffs, outfile)

	+----------------------+
	|                      |
	|  SNPKNOCK2, v0.3     |
	|  July 21, 2020       |
	|  Matteo Sesia        |
	|                      |
	+----------------------+

Copyright (C) 2020 Stanford University.
Distributed under the GNU GPLv3 open source license.

Use --help for more information.

Command line arguments:
  --bgen sim
  --keep samples_qc.txt
  --extract variants_qc.txt
  --map sim.map
  --part sim.partition.txt
  --ibd rapid/results.max
  --K 10
  --cluster_size_min 1000
  --cluster_size_max 10000
  --hmm-rho 1
  --hmm-lambda 0.001
  --windows 0
  --n_threads 1
  --seed 2020
  --compute-references
  --generate-knockoffs
  --out ./knockoffs/sim.knockoffs

Requested operations:
  --compute-references
  --generate_knockoffs


--------------------------------------------------------------------------------
Loading metadata
--------------------------------------------------------------------------------
Loading sample information from:
  sim.sample
Loading legend

┌ Info: snpknock2 command:
│ `/scratch/users/bbchu/knockoffgwas/snpknock2/bin/snpknock2 --bgen sim --keep samples_qc.txt --extract variants_qc.txt --map sim.map --part sim.partition.txt --ibd rapid/results.max --K 10 --cluster_size_min 1000 --cluster_size_max 10000 --hmm-rho 1 --hmm-lambda 0.001 --windows 0 --n_threads 1 --seed 2020 --compute-references --generate-knockoffs --out ./knockoffs/sim.knockoffs`
└ @ Knockoffs /home/users/bbchu/.julia/packages/Knockoffs/A1mdh/src/hmm_wrapper.jl:66
┌ Info: Output directory: /home/users/bbchu/hmm/knockoffs
└ @ Knockoffs /home/users/bbchu/.julia/packages/Knockoffs/A1mdh/src/hmm_wrapper.jl:67


Loading partitions from:
  sim.partition.txt
Loading IBD segments from:
  rapid/results.max
Loaded 0 IBD segments.

Printing summary of 1 windows:
     0: 0--50000
Summary of metadata for chromosome 1:
  number of samples (after | before filtering) : 2000 | 2000
  number of SNPs (after | before filtering)    : 50000 | 50000
  number of variant partitions                 : 7
  size of genomic windows                      : whole-chromosome
  number of IBD segments                       : 0


--------------------------------------------------------------------------------
Kinship (using only haplotype data)
--------------------------------------------------------------------------------
Reached 1
Chromosome 1 will be loaded from:
  haplotype file            : sim.bgen
  sample file               : sim.sample
  legend file               : sim.bim
  thinning factor           : 10
  sample filter file        : samples_qc.txt
  variant filter file       : variants_qc.txt
  number of SNPs    


Bifurcating K-means
Smallest allowed cluster size: 1000
 step	 cluster	    size	    left	   right	accepted
Bifurcating K-means completed after 0 steps.
Number of clusters: 1.




Reached 2
Individual global references written to:
  ./knockoffs/sim.knockoffs_lref.txt

Individual local references written to:
  ./knockoffs/sim.knockoffs_ref.txt


--------------------------------------------------------------------------------
Knockoffs for chromosome 1
--------------------------------------------------------------------------------
Chromosome 1 will be loaded from:
  haplotype file            : sim
  haplotype file format     : bgen
  sample file               : sim.sample
  legend file               : sim.bim
  map file                  : sim.map
  sample filter file        : samples_qc.txt
  variant filter file       : variants_qc.txt
  number of SNPs            : 50000
  number of windows         : 1
  number of haplotypes      : 4000

Loading data for chromosome 1
Reading BGEN file using 1 thread:
|....................................................................................................|

Initializing HMM with user-supplied hyperparameters: rho = 1

Process(`[4m/scratch/users/bbchu/knockoffgwas/snpknock2/bin/snpknock2[24m [4m--bgen[24m [4msim[24m [4m--keep[24m [4msamples_qc.txt[24m [4m--extract[24m [4mvariants_qc.txt[24m [4m--map[24m [4msim.map[24m [4m--part[24m [4msim.partition.txt[24m [4m--ibd[24m [4mrapid/results.max[24m [4m--K[24m [4m10[24m [4m--cluster_size_min[24m [4m1000[24m [4m--cluster_size_max[24m [4m10000[24m [4m--hmm-rho[24m [4m1[24m [4m--hmm-lambda[24m [4m0.001[24m [4m--windows[24m [4m0[24m [4m--n_threads[24m [4m1[24m [4m--seed[24m [4m2020[24m [4m--compute-references[24m [4m--generate-knockoffs[24m [4m--out[24m [4m./knockoffs/sim.knockoffs[24m`, ProcessExited(0))

## Examine Generated Knockoffs

The generated knockoffs are saved in binary PLINK format, we can import it using SnpArrays

In [20]:
x = SnpArray("knockoffs/sim.knockoffs_res0.bed")

2000×100000 SnpArray:
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x02  0x02  0x02  0x02  0x02  0x02
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x02  0x00  0x02  0x02  0x02     0x00  0x00  0x03  0x03  0x03  0x03
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x02  0x02     0x00  0x00  0x00  0x02  0x02  0x02
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x02  0x00  0x02  0x02     0x02  0x02  0x00  0x02  0x02  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x02  0x00  0x00     0x02  0x02  0x02  0x02  0x03  0x03
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x02  0x02  0x02  0x02  0x02  …  0x02  0x02  0x03  0x03  0x03  0x03
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x02  0x02  0x02  0x03  0x03     0x02  0x02  0x02  0x02  0x02  0x02
   

Notice there are 100k SNPs: the original 50k SNPs and their knockoffs. Reading the SNP names will tell us which are the originals:

In [22]:
snpid = SnpData("knockoffs/sim.knockoffs_res0").snp_info.snpid

100000-element Vector{String}:
 "snp1.k"
 "snp1"
 "snp2"
 "snp2.k"
 "snp3.k"
 "snp3"
 "snp4.k"
 "snp4"
 "snp5.k"
 "snp5"
 "snp6"
 "snp6.k"
 "snp7"
 ⋮
 "snp49995"
 "snp49995.k"
 "snp49996.k"
 "snp49996"
 "snp49997.k"
 "snp49997"
 "snp49998.k"
 "snp49998"
 "snp49999"
 "snp49999.k"
 "snp50000.k"
 "snp50000"

## Step 3: Model selection with knockoffs

Tutorial for this part coming soon! Basically, one constructs a `SnpLinAlg`, feed that into [MendelIHT.jl](https://github.com/OpenMendel/MendelIHT.jl), and calculate knockoff statistics afterwards using built-in functions like `coefficient_diff` and `threshold`.

`SnpLinAlg` performs compressed linear algebra (often faster than double precision BLAS) and `MendelIHT.jl` is a very efficient implementation of the iterative hard thresholding algorithm. For model selection, IHT is known to be superior to standard LASSO, elastic net, and MCP solvers. 

In [23]:
xla = SnpLinAlg{Float64}(x, center=true, scale=true, impute=true)

2000×100000 SnpLinAlg{Float64}:
  1.53523    1.40781    0.908525  …   0.892035   0.37139   0.376085
 -0.552097  -0.587668  -0.772373     -0.780008  -1.09078  -1.08728
  1.53523    1.40781   -0.772373      2.56408    1.83356   1.83945
 -0.552097  -0.587668  -0.772373     -0.780008  -1.09078  -1.08728
 -0.552097  -0.587668  -0.772373      0.892035   0.37139   0.376085
 -0.552097  -0.587668  -0.772373  …  -0.780008  -1.09078  -1.08728
 -0.552097  -0.587668   0.908525      0.892035   0.37139  -1.08728
 -0.552097  -0.587668  -0.772373     -0.780008  -1.09078  -1.08728
 -0.552097  -0.587668  -0.772373      0.892035   1.83356   1.83945
 -0.552097  -0.587668  -0.772373     -0.780008  -1.09078  -1.08728
  1.53523    1.40781    0.908525  …   2.56408    1.83356   1.83945
 -0.552097  -0.587668  -0.772373     -0.780008  -1.09078  -1.08728
 -0.552097   1.40781    0.908525      0.892035   0.37139   0.376085
  ⋮                               ⋱                       
  1.53523   -0.587668   2.58942    