# Generate HMM knockoffs

This tutorial closely follows the [knockoffgwas tutorial](https://msesia.github.io/knockoffgwas/tutorial.html). We go over how to generate (SHAPEIT) HMM knockoffs given (simulated) binary PLINK formatted data. 

## Installation

1. Install [knockoffgwas](https://github.com/msesia/knockoffgwas) and its dependencies
2. Install [qctools](https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/download.html) for converting between VCF and BGEN formats
3. Install [RaPID](https://github.com/ZhiGroup/RaPID) for detecting IBD segments
3. Install the following Julia packages. Within julia, type
```julia
]add SnpArrays Distributions ProgressMeter MendelIHT VCFTools
```

## Required inputs

We need multiple input files to generate knockoffs

+ Unphased genotypes in binary PLINK format
+ Phased genotypes in VCF and BGEN format: we will simulate haplotypes, store in VCF format, and convert to BGEN using [qctools](https://www.well.ox.ac.uk/~gav/qctool_v2/documentation/examples/converting.html)
    + Note: knockoffgwas requires only BGEN format, but RaPID requires VCF formats. Hence, the extra conversion
+ Map file (providing different group resolution): since data is simulated, we will generate fake map file where every snp is 1cM apart
+ IBD segment file (generated by [RaPID](https://github.com/ZhiGroup/RaPID))
+ Variant partition files (generated by snpknock2, i.e. module 2 of [knockoffgwas](https://github.com/msesia/knockoffgwas)) 
+ Sample and variant QC files

## Simulate genotypes


Our simulation will try to follow
+ Adding population structure (admixture & cryptic relatedness): [New approaches to population stratification in genome-wide association studies](https://www.nature.com/articles/nrg2813)
+ How to simulate siblings: [Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits
](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003520)

### Population structure

Specifically, lets simulate genotypes with 2 populations. We simulate 49950 normally differentiated markers and 50 unusually differentiated markers based on allele frequency difference equal to 0.6. Let $x_{ij}$ be the number of alternate allele count for sample $i$ at SNP $j$ with allele frequency $p_j$. Also let $h_{ij, 1}$ denotype haplotype 1 of sample $i$ at SNP $j$ and $h_{ij, 2}$ the second haplotype. Our simulation model is

$$h_{ij, 1} \sim Bernoulli(p_j), \quad h_{ij, 2} \sim Bernoulli(p_j), \quad x_{ij} = h_{ij, 1} + h_{ij, 2}$$

which is equivalent to  

$$x_{ij} \sim Binomial(2, p_j)$$

for unphased data. The allele frequency is $p_{j} = Uniform(0, 1)$ for normally differentiated markers, and 

$$p_{pop1, j} \sim Uniform(0, 0.4), \quad p_{pop2, j} = p_{pop1, j} + 0.6$$

for abnormally differentiated markers. Each sample is randomly assigned to population 1 or 2. 

### Sibling pairs

**Note: sibling pairs are not included yet, but will be easily added soon**.

In [90]:
# load Julia packages and some helper functions
using SnpArrays
using DelimitedFiles
using Random
using LinearAlgebra
using Distributions
using ProgressMeter
using MendelIHT
using VCFTools

"""
    simulate_pop_structure(plinkfile, n, p)

Simulate genotypes with K = 2 populations. 50 SNPs will have different allele 
frequencies between the populations. 

# Inputs
- `plinkfile`: Output plink file name. 
- `n`: Number of samples
- `p`: Number of SNPs

# Output
- `x`: A simulated `SnpArray`. Also saves binary plink files `plinkfile.bed`,
    `plinkfile.bim`, `plinkfile.fam` to the current directory
- `populations`: Vector of length `n` indicating population membership for each
    sample. Also saved as `populations.txt`
- `diff_markers`: Indices of the differentially expressed alleles. Also saved 
    as `diff_markers.txt`
"""
function simulate_pop_structure(n::Int, p::Int)
    # first simulate genotypes treating all samples equally
    x1 = BitMatrix(undef, n, p)
    x2 = BitMatrix(undef, n, p)
    pmeter = Progress(p, 0.1, "Simulating genotypes...")
    @inbounds for j in 1:p
        d = Bernoulli(rand())
        for i in 1:n
            x1[i, j] = rand(d)
            x2[i, j] = rand(d)
        end
        next!(pmeter)
    end
    # assign populations and simulate 50 unually differentiated markers
    populations = rand(1:2, n)
    diff_markers = sample(1:p, 50, replace=false)
    @inbounds for j in diff_markers
        pop1_allele_freq = 0.4rand()
        pop2_allele_freq = pop1_allele_freq + 0.6
        pop1_dist = Bernoulli(pop1_allele_freq)
        pop2_dist = Bernoulli(pop2_allele_freq)
        for i in 1:n
            d = isone(populations[i]) ? pop1_dist : pop2_dist
            x1[i, j] = rand(d)
            x2[i, j] = rand(d)
        end
    end
    return x1, x2, populations, diff_markers
end

"""
Simplified VCF writer, taken from https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
"""
function write_vcf(outfile::AbstractString, x1::AbstractMatrix, x2::AbstractMatrix)
    n, p = size(x1)
    chr = ["1"     for i in 1:p]
    pos = [100i    for i in 1:p]
    ids = ["snp$i" for i in 1:p]
    ref = ["A"     for i in 1:p]
    alt = ["T"     for i in 1:p]
    io = openvcf(outfile * ".vcf.gz", "w")
    pb = PipeBuffer()
    
    # minimum meta info
    print(pb, "##fileformat=VCFv4.2\n")
    print(pb, "##source=Knockoff.jl\n")
    print(pb, "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n")

    # header line (includes sample ID)
    sampleID = [string(i) for i in 1:n]
    print(pb, "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT")
    for id in sampleID
        print(pb, "\t", id)
    end
    print(pb, "\n")
    bytesavailable(pb) > 1048576 && write(io, take!(pb))
    
    # loop over genotypes
    pmeter = Progress(p, 1, "Writing VCF...")
    @inbounds for i in 1:p
        # write meta info (chrom/pos/snpid/ref/alt/imputation-quality)
        print(pb, chr[i], "\t", string(pos[i]), "\t", ids[i], "\t", 
            ref[i], "\t", alt[i], "\t.\tPASS\t.\tGT")
        # print ith record
        write_snp!(pb, x1, x2, i) 
        bytesavailable(pb) > 1048576 && write(io, take!(pb))
        next!(pmeter)
    end
    write(io, take!(pb))
    close(io); close(pb) # close io and buffer
end

"""
Helper function for saving the `i`th record (SNP), tracking phase information.
Here `X = X1 + X2`.
"""
function write_snp!(pb::IOBuffer, X1::AbstractMatrix, X2::AbstractMatrix, i::Int)
    x1 = @view(X1[:, i])
    x2 = @view(X2[:, i])
    n = length(x1)
    @assert n == length(x2)
    @inbounds for j in 1:n
        if x1[j] == x2[j] == 0
            print(pb, "\t0|0")
        elseif x1[j] == 0 && x2[j] == 1
            print(pb, "\t0|1")
        elseif x1[j] == 1 && x2[j] == 0
            print(pb, "\t1|0")
        elseif x1[j] == 1 && x2[j] == 1
            print(pb, "\t1|1")
        else
            error("phased genotypes can only be 0|0, 0|1, 1|0 or 1|1 but
                got $(x1[j])|$(x2[j])")
        end
    end
    print(pb, "\n")
    return nothing
end

function write_plink(outfile::AbstractString, x1::AbstractMatrix, x2::AbstractMatrix)
    n, p = size(x1)
    x = SnpArray(outfile * ".bed", n, p)
    for j in 1:p, i in 1:n
        c = x1[i, j] + x2[i, j]
        if c == 0
            x[i, j] = 0x00
        elseif c == 1
            x[i, j] = 0x02
        elseif c == 2
            x[i, j] = 0x03
        else
            error("matrix entries should be 0, 1, or 2 but was $c!")
        end
    end
    # create .bim file structure: https://www.cog-genomics.org/plink2/formats#bim
    open(outfile * ".bim", "w") do f
        for i in 1:p
            println(f, "1\tsnp$i\t0\t$(100i)\t1\t2")
        end
    end
    # create .fam file structure: https://www.cog-genomics.org/plink2/formats#fam
    open(outfile * ".fam", "w") do f
        for i in 1:n
            println(f, "$i\t1\t0\t0\t1\t-9")
        end
    end
    return nothing
end

function make_fake_mapfile(filename, p::Int)
    open(filename, "w") do io
        println(io, "Chromosome\tPosition(bp)\tRate(cM/Mb)\tMap(cM)")
        for i in 1:p
            println(io, "chr1\t", 100i, "\t1.0\t1.0")
        end
    end
end

function partition(plinkfile, rscriptPath, mapfile, qc_variants, outfile)
    bimfile = plinkfile * ".bim"
    run(`Rscript --vanilla $rscriptPath $mapfile $bimfile $qc_variants $outfile`)
end

function RaPID(executable_path, vcffile, mapfile, min_length, outfolder, window_size, r, s)
    isdir(outfolder) || mkdir(outfolder)
    run(`$executable_path -i $vcffile -g $mapfile -d $min_length -o $outfolder -w $window_size -r $r -s $s`)
end

RaPID (generic function with 1 method)

Simulate phased and unphased data

In [75]:
# simulate
Random.seed!(2021)
outfile = "sim"
n = 2000
p = 50000
x1, x2, populations, diff_markers = simulate_pop_structure(n, p)

# write phased genotypes to VCF format
write_vcf(outfile * ".phased", x1, x2)

# write unphased genotypes to PLINK binary format
write_plink(outfile, x1, x2)

# save pop1/pop2 index and unually differentiated marker indices
writedlm("populations.txt", populations)
writedlm("diff_markers.txt", diff_markers)

[32mSimulating genotypes...100%|████████████████████████████| Time: 0:00:00[39m
[32mWriting VCF...100%|█████████████████████████████████████| Time: 0:00:19[39m


In [76]:
;ls

diff_markers.txt
hmm.ipynb
populations.txt
qc_variants.txt
sim
sim.bed
sim.bim
sim.fam
sim.map
sim.partition.txt
sim.phased.vcf.gz


## Step 1: Partitions

We need
+ Map file (in particular the (cM) field will determine group resolution)
+ PLINK's bim file
+ QC file (all SNP names that pass QC)
+ output file name

Since data is simulated, there are no genomic map file. Let us generate a fake one. 

In [79]:
# generate fake map file
make_fake_mapfile("sim.map", p)

# also generate QC file that contains all SNPs and all samples
snpdata = SnpData("sim")
snpIDs = snpdata.snp_info[!, :snpid]
sampleIDs = Matrix(snpdata.person_info[!, 1:2])
writedlm("variants_qc.txt", snpIDs)
writedlm("samples_qc.txt", sampleIDs)

Now we run the partition script

In [83]:
plinkfile = "sim"
rscriptPath = "/scratch/users/bbchu/knockoffgwas/knockoffgwas/utils/partition.R"
# rscriptPath = "/Users/biona001/Desktop/knockoffgwas/knockoffgwas/utils/partition.R"
mapfile = "sim.map"
qc_variants = "variants_qc.txt"
outfile = "sim.partition.txt"
partition(plinkfile, rscriptPath, mapfile, qc_variants, outfile)

1: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 
2: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 
3: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 
4: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 
5: Quick-TRANSfer stage steps exceeded maximum (= 2500000) 


Mean group sizes: 
res_7 res_6 res_5 res_4 res_3 res_2 res_1 
    1 10000 10000 10000 10000 10000 10000 
Partitions written to: sim.partition.txt


Process(`[4mRscript[24m [4m--vanilla[24m [4m/Users/biona001/Desktop/knockoffgwas/knockoffgwas/utils/partition.R[24m [4msim.map[24m [4msim.bim[24m [4mvariants_qc.txt[24m [4msim.partition.txt[24m`, ProcessExited(0))

## Step 2: Generate Knockoffs

We need:
1. Phased genotypes in VCF and BGEN format: we will simulate haplotypes, store in VCF format, and convert to BGEN using qctools
    + Note: knockoffgwas requires only BGEN format, but RaPID requires VCF formats. Hence, the extra conversion
+ IBD segment file generated by RaPID where input is VCF format (**this seems to only run on linux**)

First generate IBD segment files

In [93]:
# executable = "/scratch/users/bbchu/RaPID/RaPID_v.1.7"
executable = "/Users/biona001/Desktop/RaPID/RaPID_v.1.7"
vcffile = "sim.phased.vcf.gz"
mapfile = "sim.map"
min_length = 5
outfolder = "rapid"
window_size = 250
r = 10
s = 2
RaPID(executable, vcffile, mapfile, min_length, outfolder, window_size, r, s)

/Users/biona001/Desktop/RaPID/RaPID_v.1.7: /Users/biona001/Desktop/RaPID/RaPID_v.1.7: cannot execute binary file


LoadError: failed process: Process(`[4m/Users/biona001/Desktop/RaPID/RaPID_v.1.7[24m [4m-i[24m [4msim.phased.vcf.gz[24m [4m-g[24m [4msim.map[24m [4m-d[24m [4m5[24m [4m-o[24m [4mrapid[24m [4m-w[24m [4m250[24m [4m-r[24m [4m10[24m [4m-s[24m [4m2[24m`, ProcessExited(126)) [126]


Run the following code in the command line directly. You may need to adjust file directories and change parameters. 

In [None]:
# Path to snpknock2 executable built as described above
SNPKNOCK2="/scratch/users/bbchu/knockoffgwas/snpknock2/bin/snpknock2"

# Create directory for output
mkdir -p "./tmp/knockoffs"

# Run snpknock2
$SNPKNOCK2 \
  --bgen "./data/haplotypes/example_chr{21:22}" \
  --keep "./samples_qc.txt" \
  --extract "./data/qc/qc_chr{21:22}.txt" \
  --map "./data/maps/genetic_map_chr{21:22}.txt" \
  --part "./data/partitions/example_chr{21:22}.txt" \
  --ibd "./data/ibd/example_chr{21:22}.txt" \
  --K 10 \
  --cluster_size_min 1000 \
  --cluster_size_max 10000 \
  --hmm-rho 1 \
  --hmm-lambda 1e-3 \
  --windows 0 \
  --n_threads 10 \
  --seed 2020 \
  --compute-references \
  --generate-knockoffs \
  --out "./tmp/knockoffs/knockoffs_chr{21:22}"