# LMM/GLMM analyses for UK Biobank data

This notebook implements pipelines for analyzing binary and quantitative traits association using BOLT-LMM (version 2.3.4), fastGWA, REGENIE and SAIGE software.

## Aim

This pipeline was initially developed to perfom genetic association analysis using various LMM methods on UK Biobank imputed data of ~500K invidivuals, although it can be used to analyze other studies.

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `(.bed/.bim/.fam)` 
    - `--bfile=prefix`
2. Imputed genotype dosages in `bgen` format (`.bgen`, `.bgi`, `.sample`)
    - `--bgenFile` and ` --sampleFile`
3. Phenotype file (white space delimited file with column headers, first two columns should be FID and IID) specify files by options `--phenoFile` and the phenotype to be analized by `--phenoCol`
4. Covariates file (same format as phenoFile) specify them by `--covarFile` for qualitative covariates use `--covarCol` and for quantitative `--qCovarCol`. If `--covarFile` is not specified then phenotype file will be used as covariate file. To specify an array of covariates you can use bash tricks, eg `--qCovarCol PC{1:20}`

Note: reference genome used **GRCh37/hg19**.

## Software specific inputs

### BoltLMM additional input

- Reference genetic maps, provided on BoltLMM website
    - `--geneticMapFile=tables/genetic_map_hg##.txt.gz`
- Reference LD scores, provided on BoltLMM website
- Use `--covarMaxLevels` to specify the number of categories of a qualitative covariate. 

## Output

Our pipeline generates 

1. Summary statistics file for each variant analyzed
2. QQ and Manhattan plots for these summary statistics

## References

To learn more about each of the specific methods applied in this pipeline please refer to the published papers and software documentation:

1. [Bolt-LMM](http://dx.doi.org/10.1038/ng.3190) and [documentation](https://alkesgroup.broadinstitute.org/BOLT-LMM)
2. [FastGWA](http://dx.doi.org/10.1038/s41588-019-0530-8) and [documentation](https://cnsgenomics.com/software/gcta/#Overview)
3. [REGENIE](https://www.biorxiv.org/content/10.1101/2020.06.19.162354v2) and [documentation](https://rgcgithub.github.io/regenie/)
4. [SAIGE](http://dx.doi.org/10.1038/s415) and [documentation](https://github.com/weizhouUMICH/SAIGE)
5. [GMMAT](https://github.com/hanchenphd/GMMAT) and [documentation](https://github.com/hanchenphd/GMMAT/blob/master/inst/doc/GMMAT.pdf)


## Command interface

In [1]:
sos run LMM.ipynb -h

usage: sos run LMM.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  boltlmm
  gcta
  fastGWA
  PLINK_QC
  regenie
  regenie_burden
  SAIGE

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --sampleFile . (as path)
                        Path to sample file
  --bfile VAL (as path, required)
                        Genotype files in plink binary this is used for
                        computing the GRM
  --genoFile  paths

                        Path to bgen or bed files
  --phenoFile VAL (as path, required)
                        Phenotype file for quantitative trait (BMI)
  --phenoCol VAL VAL ... (as t

## Global parameter setting

In [2]:
[global]
# the output directory for generated files
parameter: cwd = path
# Path to sample file
parameter: sampleFile = path('.')
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path
# Path to bgen or bed files 
parameter: genoFile = paths
# Phenotype file for quantitative trait (BMI)
parameter: phenoFile = path
# Phenotype to be analyzed (specify the column)
parameter: phenoCol = list
# Covariate file path. Will use phenoFile if empty
parameter: covarFile = path('.')
# Summary statisticss format file path used for unifying output column names. Will not unify names if empty
parameter: formatFile = path('.')
# Qualitative covariates to be used in the analysis
parameter: covarCol = []
# Quantitative covariates to be used in the analysis
parameter: qCovarCol = []
# Specific number of threads to use
parameter: numThreads = 2
# Minimum MAF to be used
parameter: bgenMinMAF = 0.001
# Mimimum info score to be used
parameter: bgenMinINFO = 0.8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# The container with the lmm software. Can be either a dockerhub image or a singularity `sif` file.
# Default is set to using dockerhub image
parameter: container_lmm = 'statisticalgenetics/lmm:1.9'
parameter: container_marp = 'gaow/marp'
if not covarFile.is_file():
    covarFile = phenoFile
cwd = path(f"{cwd:a}")

## Illustration with minimal working examples

```
JOB_OPT='-j 2'
```

### BOLT-LMM example command

On a minimal working example (MWE) dataset (about 1min to complete the analysis),

```
sos run LMM.ipynb boltlmm \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --genoFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --formatFile data/boltlmm_template.yml \
    --LDscoresFile data/LDSCORE.1000G_EUR.tab.gz \
    --geneticMapFile data/genetic_map_hg19_withX.txt.gz \
    --phenoCol BMI \
    --covarCol SEX \
    --covarMaxLevels 10 \
    --qCovarCol AGE \
    --numThreads 5 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --lmm-option none \
    --p-filter 1 \
    $JOB_OPT
```

Please note that the command above is only meant to demonstrate the usage of the pipeline. Data will be generated to a folder called `output`. We set `--lmm-option` to `none` to not run LMM on this minimal data-set. The `--pval` column name `P_LINREG` for QQ/Manhattan plot is also p-value from conventional linear regression. In practice we will definitely want to use one of the LMM options in BoltLMM. Default is `lmm` switch in `bolt` if you don't specify `--lmm-option`.

### fastGWA example command

On a minimal working example (MWE) dataset (analysis completes almost instantly),

```
sos run LMM.ipynb fastGWA \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --genoFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --formatFile data/fastGWA_template.yml \
    --phenoCol BMI \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 1 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --parts 2 \
    --p-filter 1 \
    $JOB_OPT
```

#### REGENIE example command
On a minimal working example (MWE) dataset,

```
sos run LMM.ipynb regenie \
    --cwd output \
    --bfile data/genotypes21_22.bed \
    --maf-filter 0.001 \
    --sampleFile data/imputed_genotypes.sample \
    --genoFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --formatFile data/regenie_template.yml \
    --phenoCol ASTHMA T2D\
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 8 \
    --bsize 1000 \
    --trait bt \
    --minMAC 4 \
    --bgenMinMAF 0.05 \
    --bgenMinINFO 0.8 \
    --reverse_log_p \
    --p-filter 1 \
    $JOB_OPT
```

### REGENIE burden example command

On a minimal working example (MWE) dataset,
```
sos run LMM.ipynb regenie_burden \
    --cwd output \
    --bfile genotypes_21_22_plink.exome.bed \
    --genoFile ukb23155_c22_b0_v1.plink.exome.filtered.bed \
    --phenoFile phenotype_burden.txt\
    --phenoCol ASTHMA T2D \
    --formatFile data/regenie_template.yml \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 8 \
    --bsize 1000 \
    --anno_file annotation_file.txt\
    --set_list set_list_file_chr22.txt \
    --mask_file mask_file.txt \
    --keep_gene keep_gene.txt\
    --aaf_bins 0.05 \
    --trait bt \
    --build_mask max
```

### SAIGE example command

On a minimal working example (MWE) dataset,

```
sos run LMM.ipynb SAIGE \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --genoFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --phenoCol BMI \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 4 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --trait_type quantitative \
    --pval p.val \
    --bp POS \
    --p-filter 1 \
    $JOB_OPT
```

### Input using `bed` format

sos run LMM.ipynb fastGWA \
    --cwd output \
    --bfile data/genotypes.bed \
    --genoFile data/genotypes21_22.bed \
    --phenoFile data/phenotypes.txt \
    --formatFile data/fastGWA_template.yml \
    --phenoCol BMI \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 1 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --parts 2 \
    --p-filter 1 \
    $JOB_OPT

### GMMAT example command
sos run LMM.ipynb GMMAT \
    --cwd gemma_output \
    --bfile 100K_chr22.bed \
    --genoFile 100K_chr22.bed \
    --phenoFile MWE_pheno_new1.txt \
    --formatFile gmmat_template.yml \
    --phenoCol AD \
    --covarCol SEX \
    --covarMaxLevels 10 \
    --qCovarCol AGE \
    --numThreads 5 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --lmm-option none \
    --p-filter 1 \
    --geno_filter 0.0005 \
    --nperbatch 10 \
    $JOB_OPT

### Run workflow on a cluster

The shell variable `JOB_OPT` was set to `-j 2`. That is, run 2 jobs in parallel on a local computer (each using 5 threads due to `--numThreads 5`).

On cluster we use a job template, and configure `JOB_OPT` as follows: 

```
JOB_OPT="-c farnam.yml -q farnam -J 40"
```

Here we use task queue `farnam` configured in file `farnam.yml`. We allow for at most 40 jobs in the cluster job queue.

## BoltLMM workflow implementation

To install from source code follow instructions here: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-70002.2
    - On Linux machine a binary executable is provided and can be used.
    - Supporting files such as LD score file and genetic map file can be found [in the installation bundle](https://data.broadinstitute.org/alkesgroup/BOLT-LMM/downloads/BOLT-LMM_v2.3.4.tar.gz).
    - For a complete description on bolt commands go to: http://manpages.ubuntu.com/manpages/eoan/en/man1/bolt.1.html.

**A note for developers**: it is important to have input and output for each step. Input files and output files are best derived from one another.

BOLT-LMM software computes statistics for testing association between phenotypes and genotypes using a linear mixed model


```
--bfile = accepts genotype files in PLINK binary format (.fam, .bed, .bim)
--geneticMapFile = Oxford-format file for interpolating genetic distances: tables/genetic_map_hg##.txt.gz
--phenoFile = phenotype file (header required; FII and IID must be first two columns)
--phenoCol = phenotype columns header
--covarFile = covariate file (header required; FII and IID must be first two columns)
--covarCol = categorical covariate column(s); for >1, use multiple --covarCol and/or {i:j} expansion
--qcovarCol = quantitative covariate column(s); for  >1, use multiple --qCovarCol and/or {i:j} expansion
--lmm = compute assoc stats under the inf model and with Bayesian non-inf prior (VB approx), if power gain expected
--modelSnps = file(s) listing SNPs to use in model (i.e., GRM) (default: use all non-excluded SNPs)
--LDscoresFile = LD Scores for calibration of Bayesian assoc stats: tables/LDSCORE.1000G_EUR.tab.g
--numThreads = number of computational threads
--statsFile = output file for assoc stats at PLINK genotypes
--bgenFile = file(s) containing Oxford BGEN-format genotypes to test for association
--sampleFile = file containing Oxford sample file corresponding to BGEN file(s)
--bgenMinMAF = MAF threshold on Oxford BGEN-format genotypes; lower-MAF SNPs will be ignored
--bgenMinINFO = INFO threshold on Oxford BGEN-format genotypes; lower-INFO SNPs will be ignored
--statsFileBgenSNPs = output file for assoc stats at BGEN-format genotypes
```

It is important to know that BOLT-LMMv2.3.4 accepts bgen files only in 8bit formatting as stated below:

*WARNING: The BGEN format comprises a few sub-formats; we have only implemented support for the versions (and specific data layouts) used in the UK Biobank N=150K and N=500K releases. In particular, for BGEN v1.2, BOLT-LMM currently only supports the 8-bit encoding used for the UK Biobank N=500K data. (Starting with BOLT-LMM v2.3.3, missing values in BGEN v1.2 data are now allowed.)*

In [None]:
# Run BOLT analysis
[boltlmm_1]
# Maximum categories of covariates allowed 
parameter: covarMaxLevels = int
# Path to LDscore file for reference population
parameter: LDscoresFile = path
# Path to genetic map file used to interpolate genetic map coordinates from SNP physical (base pair) positions
parameter: geneticMapFile = path
# LMM option: lmm, lmmInfOnly, and lmmForceNonInf
parameter: lmm_option = 'lmm'
depends: LDscoresFile, geneticMapFile
input: genoFile, group_by = 1
output: f'{cwd}/cache/{_input:bn}.{phenoFile:bn}_{phenoCol[0]}.boltlmm.snp_stats.gz'
file_options=f"--bfile {bfile:n} --bgenFile={_input} --bgenMinMAF={bgenMinMAF} --bgenMinINFO={bgenMinINFO} --sampleFile={sampleFile} --statsFileBgenSnps={_output} --statsFile={_output:nn}.ref_stats.gz " if _input.suffix == ".bgen" else f"--bfile={_input:n} --statsFile={_output} "
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', volumes = [f"{cwd:a}:{cwd:a}"]
    bolt \
    --phenoFile=${phenoFile} \
    --phenoCol=${phenoCol[0]} \
    --covarFile=${covarFile} \
    ${' '.join(['--covarCol=%s ' % x for x in covarCol if x is not None])} \
    --covarMaxLevels=${covarMaxLevels} \
    ${' '.join(['--qCovarCol=%s ' % x for x in qCovarCol if x is not None])} \
    --LDscoresFile=${LDscoresFile} \
    --geneticMapFile=${geneticMapFile} \
    ${('--' + lmm_option) if lmm_option in ['lmm', 'lmmInfOnly', 'lmmForceNonInf'] else ''} \
    ${file_options} \
    --numThreads=${numThreads} \
    --verboseStats 

bash: expand = "${ }", active = (_index != 0)
    # remove redundant reference summary stats file
    rm -f ${_output:nn}.ref_stats.gz

bash: expand = "${ }", active = (_index == 0)
    # rename reference summary stats file
    if [ -f ${_output:nn}.ref_stats.gz ]; then
      mv ${_output:nn}.ref_stats.gz ${cwd}/${phenoFile:bn}_${phenoCol[0]}.boltlmm.ref_stats.gz
    else
       echo "File does not exist."
    fi

## fastGWA workflow implementation

Installation instructions can be found in https://cnsgenomics.com/software/gcta/#Download. On Linux machine a binary executable is provided and can be used.

Documentation: https://cnsgenomics.com/software/gcta/#fastGWA

### Step 1: Creation of the GRM
The GRM only needs to be created once for all the phenotypes to analyze with the same genotypic data. In this step the GRM calculation is divided into multiple parts for a faster computational time.

In [None]:
# Partition the GRM into 100 parts and allocate 8GB memory to each job
[gcta_1]
# Number of parts the GRM calculation is to be partitioned
parameter: parts = 100
part_number = [f'{parts}_{format(x+1, "0" + str(len(str(parts))))}' for x in range(parts)]
input: bfile, for_each = 'part_number'
output: f'{cwd}/cache/{_input:bn}.part_{_part_number}.grm.bin', 
        f'{cwd}/cache/{_input:bn}.part_{_part_number}.grm.N.bin', 
        f'{cwd}/cache/{_input:bn}.part_{_part_number}.grm.id'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    gcta64 \
    --bfile ${_input[0]:n} \
    --make-grm-part ${parts} ${_index+1} \
    --thread-num ${numThreads} \
    --out ${_output[0]:nnn}

### Step 2: Combine all the GRM parts into one file

In [None]:
# Merge all the parts together (Linux, Mac)
[gcta_2]
input: group_by = 'all'
output: f'{cwd}/{bfile:bn}.grm.bin', 
        f'{cwd}/{bfile:bn}.grm.N.bin', 
        f'{cwd}/{bfile:bn}.grm.id' 
task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '6G', cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # here input is results all parts each having 3 items. We need to get the corresponding every other 3 items
    cat ${paths(_input[::3])} > ${_output[0]}
    cat ${paths(_input[1::3])} > ${_output[1]}
    cat ${paths(_input[2::3])} > ${_output[2]}
    #rm ${paths(_input)}

### Step 3: Make a sparse GRM to be used in the association analyses

In [None]:
# Make a sparse GRM from the merged full-dense GRM
[gcta_3]
output: f'{cwd}/{bfile:bn}.grm.sp' 
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = 1, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    gcta64 --grm ${_output:nn} --make-bK-sparse 0.05 --out ${_output:nn}

### Step 4: Run the single variant association analysis using FastGWA

In [None]:
# fastGWA mixed model (based on the sparse GRM generated above)
[fastGWA_1]
parameter: grmFile = path(f'{cwd}/{bfile:bn}.grm.sp')
depends: grmFile
# extract and prepare phenotype & covariate files
import pandas as pd
dat = pd.read_csv(phenoFile, header=0, delim_whitespace=True)
if len(phenoCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.fastGWA_phenotype", sep=' ', index=False, columns = ['FID', 'IID'] + phenoCol)
dat = pd.read_csv(covarFile, header=0, delim_whitespace=True)
if len(covarCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.fastGWA_covar", sep=' ', index=False, columns = ['FID', 'IID'] + covarCol)
if len(qCovarCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.fastGWA_qcovar", sep=' ', index=False, columns = ['FID', 'IID'] + qCovarCol)

input: genoFile, group_by = 1
input_options = f"--bgen {_input} --info {bgenMinINFO} --sample {sampleFile}" if _input.suffix == ".bgen" else f"--bfile {_input:n}"
output: f'{cwd}/cache/{_input:bnn}.{phenoFile:bn}.fastGWA.gz'
fail_if(not path(f'{_input}.bgi').is_file() and _input.suffix == '.bgen', msg = f'Cannot find file ``{_input}.bgi``. Please generate it using command ``bgenix -g {_input} -index``.') if _input.suffix == ".bgen" else f"continue"
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '5G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    gcta64 \
    ${input_options} \
    --grm-sparse ${grmFile:nn} \
    --maf ${bgenMinMAF} \
    --fastGWA-mlm \
    --pheno ${cwd}/${phenoFile:bn}.fastGWA_phenotype \
    --qcovar ${cwd}/${covarFile:bn}.fastGWA_qcovar \
    --covar ${cwd}/${covarFile:bn}.fastGWA_covar \
    --threads ${numThreads} \
    --out ${_output:nn}\
    && gzip -f --best ${_output:n}

Output from each step:

1. **gcta_1 for x number of parts (in the example above x=100, so this step will create 400 files):**
* test.part_{_part_number}.grm.bin
* test.part_{_part_number}.grm.N.bin 
* test.part_{_part_number}.grm.id
* test.part_{_part_number}.log (the program creates the log file so there is no need for .stderr and .stdout)

2. **gcta_2 this step creates 5 output files:**
* test.grm.bin (it is a binary file which contains the lower triangle elements of the GRM)
* test.grm.N.bin (it is a binary file which contains the number of SNPs used to calculate the GRM)
* test.grm.id (no header line; columns are family ID and individual ID, see above)
* test.grm.stderr
* test.grm.stdout

3. **gcta_3 this step creates 3 output files:**
* test.grm.sp (sparse GRM made from the dense GRM)
* test.grm.sp.stderr
* test.grm.sp.stdout

4. **fastGWA this step creates 2 output files per chromosome**
* test{chr1:22}.fastGWA
* test{chr1:22}.fastGWA.log

# REGENIE workflow implementation

Documentation can be found [here](https://rgcgithub.github.io/regenie/). Binary and quantitative traits should be analyzed separately. 

In [None]:
# Select the SNPs and samples to be used based on maf, geno, hwe and mind options
[PLINK_QC]
parameter: maf_filter = 0.0
parameter: geno_filter = 0.0
parameter: hwe_filter = 0.0
parameter: mind_filter = 0.0
input: bfile
output: f'{cwd}/cache/{bfile:bn}.qc_pass.id', f'{cwd}/cache/{bfile:bn}.qc_pass.snplist' 
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' 
    plink2 \
      --bfile ${bfile:n} --mac 1 \
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      --write-snplist --write-samples --no-id-header \
      --threads ${numThreads} \
      --out ${_output[0]:n} 

## Step 1: fitting the null

In [None]:
# Run REGENIE step 1: fitting the null
[regenie_1,regenie_burden_1]
# Size of the genotype blocks to be used 
parameter: bsize = 400
# Path to temporarily store block predictions
parameter: lowmem_dir = cwd
# Specify that traits are binary with 0=control,1=case,NA=missing (default is quantitative)
parameter: trait = 'bt'
# extract and prepare phenotype & covariate files
import pandas as pd
import numpy as np
dat = pd.read_csv(phenoFile, header=0, delim_whitespace=True, dtype=str)
dat = dat.replace(to_replace =np.nan, value ="NA")
if len(phenoCol) > 0:    
    dat.to_csv(f"{cwd}/{phenoFile:bn}.regenie_phenotype", sep=' ', index=False, columns = ['FID', 'IID'] + phenoCol)
dat = pd.read_csv(covarFile, header=0, delim_whitespace=True)
if len(covarCol) > 0 or len(qCovarCol) > 0:
    dat = dat.dropna(subset=covarCol)
    dat = dat.dropna(subset=qCovarCol)
    dat.replace(to_replace =np.nan, value ="NA")
    dat1 = pd.DataFrame(dat, columns = ['FID','IID'] + covarCol)
    dat1 = dat1.astype(int)
    dat2 = pd.DataFrame(dat, columns = ['IID'] + qCovarCol)
    merged_left = pd.merge(left=dat1, right=dat2, how='left', left_on='IID', right_on='IID')
    merged_left.to_csv(f"{cwd}/{phenoFile:bn}.regenie_covar", sep=' ', index=False)
depends: f'{cwd}/cache/{bfile:bn}.qc_pass.snplist', f'{cwd}/cache/{bfile:bn}.qc_pass.id'
input: geno = bfile, pheno = f"{cwd}/{phenoFile:bn}.regenie_phenotype", covar = f"{cwd}/{phenoFile:bn}.regenie_covar", qc = output_from("PLINK_QC")
output: f'{cwd}/{phenoFile:bn}_' + "_".join([x for x in phenoCol]) + '.regenie_pred.list'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '15G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', volumes = [f"{lowmem_dir:a}:{lowmem_dir:a}"]
    regenie \
      --step 1 \
      --bed ${_input["geno"]:n} \
      --phenoFile ${_input["pheno"]} \
      --covarFile ${_input["covar"]} \
      --keep ${_input["qc"][0]} \
      --extract ${_input["qc"][1]} \
      ${('--' + trait) if trait in ['bt'] else ''} \
      --bsize ${bsize} \
      --lowmem --lowmem-prefix ${lowmem_dir:a}/${_output:bn} \
      --threads ${numThreads} \
      --out ${_output:nn}.regenie

## Step 2: association analysis

In [None]:
# Run REGENIE step 2: association analysis
[regenie_2]
# Size of the genotype blocks to be used 
parameter: bsize = 400
# Mimimum allele count to be used
parameter: minMAC = int
parameter: trait = 'bt'
input: genoFile, group_by = 1, group_with = dict(info=[(path(f'{cwd}/{phenoFile:bn}_' + "_".join([x for x in phenoCol]) + '.regenie_pred.list'))] * len(genoFile))
input_options = f"--bgen {_input} --sample {sampleFile}" if _input.suffix == ".bgen" else f"--bed {_input:n}"
output: [f'{cwd}/cache/{_input:bn}_'+ str(phenoCol[i]) + '.regenie.gz' for i in range(len(phenoCol))]
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '15G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash:container=container_lmm, expand = "${ }", stderr = f'{cwd}/cache/{_input:bn}.stderr', stdout = f'{cwd}/cache/{_input:bn}.stdout', volumes = [f"{cwd:a}:{cwd:a}"]
    set -e
    regenie \
     --step 2 \
     ${input_options} \
     --phenoFile ${cwd}/${phenoFile:bn}.regenie_phenotype \
     --covarFile ${cwd}/${covarFile:bn}.regenie_covar \
     --phenoColList ${','.join(phenoCol)} \
     ${('--' + trait) if trait in ['bt'] else ''} \
     --firth 0.01 --approx \
     --pred ${_input.info} \
     --bsize ${bsize} \
     --minMAC ${minMAC} \
     --minINFO ${bgenMinINFO}\
     --split \
     --threads ${numThreads} \
     --out ${cwd}/cache/${_input:bn} && \
     gzip -f --best ${_output:n}

## Regenie burden test

In [1]:
# Run regenie for burden tests
[regenie_burden_2]
# Specify that traits are binary with 0=control,1=case,NA=missing (default is quantitative)
parameter: trait = 'bt'
# Size of the genotype blocks to be used 
parameter: bsize = 400
# Annotation file format: variantID, gene and functional annotation (space/tab delimited)
parameter: anno_file = path
# This file lists variants within each set/gene to use when building masks. Format: set/gene name, chromosome, physical pos set/gene, then by a comma-separated list of variants included in the set/gene.
parameter: set_list = path
# Select specific genes/sets to test
parameter: keep_gene = path
# Allele frequency file. format: variantId, alternative allele frequency
#parameter: aaf_file = path
# Select the annotations to be used in the mask file. format: mask# annotatio type
parameter: mask_file = path
# Select the upper MAF to generate masks
parameter: aaf_bins = 0.05
# The way in which the alternative alleles are counted
parameter: build_mask = 'max'
input: genoFile, group_by = 1, group_with = dict(info=[(path(f'{cwd}/{phenoFile:bn}_' + "_".join([x for x in phenoCol]) + '.regenie_pred.list'))] * len(genoFile))
input_options = f"--bgen {_input} --sample {sampleFile}" if _input.suffix == ".bgen" else f"--bed {_input:n}"
output: [f'{cwd}/cache/{_input:bn}_burden_'+ str(phenoCol[i]) + '.regenie.gz' for i in range(len(phenoCol))]
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '15G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash:container=container_lmm, expand = "${ }", stderr = f'{cwd}/cache/{_input:bn}.stderr', stdout = f'{cwd}/cache/{_input:bn}.stdout', volumes = [f"{cwd:a}:{cwd:a}"]
    set -e
    regenie \
      --step 2 \
      ${input_options} \
      --phenoFile ${cwd}/${phenoFile:bn}.regenie_phenotype \
      --covarFile ${cwd}/${covarFile:bn}.regenie_covar \
      --phenoColList ${','.join(phenoCol)} \
      ${('--' + trait) if trait in ['bt'] else ''} \
      --firth --approx \
      --pred ${_input.info} \
      --set-list ${set_list} \
      --extract-sets ${keep_gene}\
      --aaf-bins ${aaf_bins} \
      --write-mask \
      ${('--build-mask ' + build_mask) if build_mask in ['max','sum','comphet'] else ''} \
      --bsize ${bsize} \
      --check-burden-files \
      --out  ${cwd}/cache/${_input:bn}_burden && \
      gzip -f --best ${_output:n}

## SAIGE workflow implementation

We need to create a conda enviroment for the installation of SAIGE in Yale's HRC cluster. Instructions in https://github.com/weizhouUMICH/SAIGE

### Step 1: fitting the null

In [None]:
# Fit SAIGE null model
[SAIGE_1]
# trait type, eg 'binary' or 'quantitative'
parameter: trait_type = str
# Whether to use LOCO or not
parameter: loco = 'TRUE'
# Name of the sample column
parameter: sampleCol='IID'
#Path specific to SAIGE script
parameter: script_path = path('~/software/bin/step1_fitNULLGLMM.R')
# Inverse normalization only for non-normal quantitative traits
parameter: invNormalize = 'FALSE'
input: bfile, phenoFile
output: f'{cwd}/{bfile:bn}.{phenoFile:bn}.SAIGE.rda', f'{cwd}/{bfile:bn}.{phenoFile:bn}.SAIGE.varianceRatio.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', template_name='conda', env_name='RSAIGE'
    Rscript ${script_path} \
        --plinkFile=${_input[0]:n} \
        --phenoFile=${_input[1]} \
        --phenoCol=${phenoCol[0]} \
        ${('--covarColList=' + ','.join(covarCol + qCovarCol)) if len(covarCol + qCovarCol) else ''} \
        --sampleIDColinphenoFile=${sampleCol} \
        --traitType=${trait_type} \
        --outputPrefix=${_output[0]:n} \
        --nThreads=${numThreads} \
        --LOCO=${loco} \
        --invNormalize=${invNormalize} \
        --IsOverwriteVarianceRatioFile=TRUE

### Step 2: perform single variant association test

In [None]:
# Compute SAIGE statistics
[SAIGE_2]
# Mimimum allele count to be used
parameter: bgenMinMAC = 4
#Specify whether to output allele frequencies in cases and controls
parameter: af_caco = 'TRUE'
#Path specific to SAIGE script
parameter: script_path = path('~/software/bin/step2_SPAtests.R')
# Fix SAIGE non-standard sample file input
import pandas as pd
dat = pd.read_csv(sampleFile, header=0, skiprows=lambda x: x == 1, delim_whitespace=True)
dat.to_csv(f"{cwd}/{sampleFile:bn}.SAIGE_sample", sep=' ', index=False, header=False, columns = [dat.columns[0]])

input: for_each='genoFile'
input_options = f"--bgenFile={_genoFile} --bgenFileIndex=${_genoFile}.bgi --sampleFile=${cwd}/{sampleFile:bn}.SAIGE_sample --minInfo=${bgenMinINFO}" if _input.suffix == ".bgen" else f"--plinkFile={_input:n}"
output: f'{cwd}/cache/{_genoFile:bn}.{phenoFile:bn}.SAIGE.gz'
fail_if(not path(f'{_genoFile}.bgi').is_file() and _input.suffix == '.bgen', msg = f'Cannot find file ``{_genoFile}.bgi``. Please generate it using command ``bgenix -g {_genoFile} -index``.')
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template_name='conda', env_name='RSAIGE'
    Rscript ${script_path} \
        ${input_options}\
        --minMAF=${bgenMinMAF} \
        --minMAC=${bgenMinMAC} \
        --GMMATmodelFile=${_input[0]} \
        --varianceRatioFile=${_input[1]} \
        --SAIGEOutputFile=${_output:n} \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=${af_caco} \
        && sed '1 s/rsid //' -i ${_output:n} \
        && gzip -f --best ${_output:n} \
        && mv ${_output:n}.bgen.txt.gz ${_output}

Output from each step:

**From step 1**

1. Model file: `${_output}.rda`

2. Association result file for the subset of randomly selected markers: `${_output}.results.txt`

3. Variance ratio file: `${_output}.varianceRatio.txt`

**From step 2**

1. A file with association results for each chromosome (Note: this are given in regard to Allele 2)

## GMMAT workflow implementation
Documentation can be found [here](https://github.com/hanchenphd/GMMAT/blob/master/inst/doc/GMMAT.pdf)

### Step 1: Creation of the GRM 

In [None]:
#Calculate standardized GRM using GEMMA
[gemma_grm]
parameter: grmFile = path(f'{cwd}/{bfile:bn}.sXX.txt')
input: bfile
output: grmFile
task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '6G', cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: container='/mnt/mfs/statgen/containers/lmm.sif', expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout' 
    gemma \
    -bfile ${_input:n} \
    -gk 2 \
    -o ${_output:bnn} \
    -outdir ${grmFile:d}

### Step 2: Fitting the null

In [None]:
# Run GMMAT step1: fit a GLMM with covariate adjustment and random effects to account for population structure and family or cryptic relatedness
[GMMAT_null]
#use the standardized GRM file generated in gemma steps
parameter: grmFile = path(f'{cwd:d}/{bfile:bn}.sXX.txt')
#a colum in the  data frame data, indicaing e id of samples
parameter: phenoCol = 'AD'
parameter: idCol = 'IID'
input: phenoFile, f'{bfile:n}.fam', grmFile
output: f'{cwd}/{bfile:bn}.{phenoFile:bn}.GMMAT.rds'
R: container='/mnt/mfs/statgen/containers/lmm.sif',expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    library('GMMAT')
    library('data.table')
    library('dplyr')
   
    #Prepare phenotype and covariates in an R data frame
    pheno = fread(${_input[0]:r}, header = TRUE)
    #Pheno are currently coded as 1, 2, and -9, need to recoded as 0,1,and NA
    pheno$${phenoCol}=  recode(pheno$${phenoCol}, `2` = 1, `1` = 0, `-9` = NULL)
    #Prepare GRM file in a R data frame
    GRM = as.matrix(fread(${_input[2]:r}, header = FALSE))
    #Extract IIDs from .fam file
    id_vector = fread(${_input[1]:r}, header = FALSE)[,2]
    #make the GRM colnames and rownames using the actual IID
    colnames(GRM) = t(id_vector)
    rownames(GRM) = t(id_vector)

    fit_null = glmmkin(${phenoCol} ~ ${"+".join(covarCol + qCovarCol)} , 
                           data = pheno, 
                           kins = GRM, 
                           id = '${idCol}',
                           family = binomial(link = "logit"))
    saveRDS(fit_null, ${_output:r})

### Step 3: Perform single variant score test for common variants

In [None]:
#Run GMMAT step2: single variant score test(based on the null model built above)
[GMMAT]
#the maximum rate alllowed for a variant to be included
parameter: geno_filter = 0.01
#how many SNPs should be tested in a batch
parameter: nperbatch = 100
depends:  f'{cwd}/{bfile:bn}.{phenoFile:bn}.GMMAT.rds'
input: genoFile, group_by = 1
output: f'{cwd}/{_input:bn}.{phenoFile:bn}.gmmat.score.txt.gz'
R:  container='/mnt/mfs/statgen/containers/lmm.sif', expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    library('GMMAT')
    null_model = readRDS(${_depends:r})
    glmm.score(null_model, 
               infile = ${_input:nr}, 
               outfile = ${_output:nr}, 
               MAF.range = c(${bgenMinMAF},1), 
               miss.cutoff = ${geno_filter},
               nperbatch = ${nperbatch})
bash: container=container_lmm,expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    gzip ${cwd}/${_input:bn}.${phenoFile:bn}.gmmat.score.txt

## SMMAT workflow implementation
Documentation can be found [here](https://github.com/hanchenphd/GMMAT/blob/master/inst/doc/GMMAT.pdf)

### Step 1: Creation of the GRM 

### Step 2: Fitting the null

In [None]:
# Run SMMAT step1: fit a GLMM with covariate adjustment and random effects to account for population structure and family or cryptic relatedness
[SMMAT_null]
#use the standardized GRM file generated in gemma steps
parameter: grmFile = path(f'{cwd:d}/{bfile:bn}.sXX.txt')
#a colum in the  data frame data, indicaing e id of samples
parameter: phenoCol = 'AD'
parameter: idCol = 'IID'
input: phenoFile, f'{bfile:n}.fam', grmFile
output: f'{cwd}/{bfile:bn}.{phenoFile:bn}.GMMAT.rds'
R: container='/mnt/mfs/statgen/containers/lmm.sif',expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    library('GMMAT')
    library('data.table')
    library('dplyr')
   
    #Prepare phenotype and covariates in an R data frame
    pheno = fread(${_input[0]:r}, header = TRUE)
    #Pheno are currently coded as 1, 2, and -9, need to recoded as 0,1,and NA
    pheno$${phenoCol}=  recode(pheno$${phenoCol}, `2` = 1, `1` = 0, `-9` = NULL)
    #Prepare GRM file in a R data frame
    GRM = as.matrix(fread(${_input[2]:r}, header = FALSE))
    #Extract IIDs from .fam file
    id_vector = fread(${_input[1]:r}, header = FALSE)[,2]
    #make the GRM colnames and rownames using the actual IID
    colnames(GRM) = t(id_vector)
    rownames(GRM) = t(id_vector)

    fit_null = glmmkin(${phenoCol} ~ ${"+".join(covarCol + qCovarCol)} , 
                           data = pheno, 
                           kins = GRM, 
                           id = '${idCol}',
                           family = binomial(link = "logit"))
    saveRDS(fit_null, ${_output:r})

### Step 3: Perform variant set burden test for rare variants

In [None]:
# run SMMAT step 2:  variant set burden tests (based on the null model built above)
[SMMAT]
input: f'{bfile:n}.bed', f'{bfile:n}.fam',f'{bfile:n}.bim' group_by = 1, groupFile
depends: f'{cwd}/{bfile:bn}.{phenoFile:bn}.GMMAT.rds'
output: f'{cwd}/{_input[0]:bn}.{phenoFile:bn}.smmat.burden.txt.gz'
R:  container='/mnt/mfs/statgen/containers/lmm.sif', expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    library('GMMAT')
    library('SNPRelate')
    snpgdsBED2GDS(${_input[0]}, ${_input[1]}, ${_input[2]}, f'{cwd}/{_input[0]:bn}.gds')
    null_model = readRDS(${_depends:r})
    burden.test = SMMAT(null_model,
                        group.file = ${_input[4]},
                        MAF.range = c(1e-7, ${maf_max_filter}),
                        miss.cutoff = ${geno_filter},
                        method = 'davies',
                        tests = 'B')
    write.table(burden.test,f'{cwd}/{_input[0]:bn}.{phenoFile:bn}.smmat.burden.txt', sep = '\t', quote = F, col.names = T, row.names = F)
  bash: container=container_lmm,expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    gzip ${cwd}/${_input[0]:bn}.${phenoFile:bn}.smmat.score.txt

Output from each step:

**From step 1**

Model file:` ${_output}.rda`


**From step 2**

A file with association results(score satistics, variance of score, score test P value) for each chromosome (Note: this are given in regard to Allele 2

## Merge results

In [1]:
# Merge results and log files
[boltlmm_2, fastGWA_2, SAIGE_3, regenie_3, regenie_burden_3, GMMAT_1]
parameter:reverse_log_p = False
depends: formatFile
input: group_by = lambda x: [x[i::len(phenoCol)] for i in range(len(phenoCol))], group_with='phenoCol'
output: f'{cwd}/{phenoFile:bn}_{_phenoCol}.{step_name.rsplit("_",1)[0]}.snp_stats.gz',
        f'{cwd}/{phenoFile:bn}_{_phenoCol}.{step_name.rsplit("_",1)[0]}.snp_counts.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '1h', mem = '20G', cores = 1, tags = f'{step_name}_{_output[0]:bn}'
python: container=container_lmm, expand ='${ }'
    import gzip
    import pandas as pd
    if ${formatFile.is_file()}:
        output = '${_output[0]:n}' + '_original_columns' + '${_output[0]:x}'
    else:
        output = '${_output[0]}'
   
    data = pd.concat([pd.read_csv(f, compression='gzip', header=0, delim_whitespace=True, quotechar='"', comment='#') for f in [${_input:r,}]], ignore_index=True)
    data.to_csv(output, compression='gzip', sep='\t', header = True, index = False)
    # unify output format
    if ${formatFile.is_file()} or ${reverse_log_p}:
        sumstats = pd.read_csv(output, compression='gzip', header=0, delim_whitespace=True, quotechar='"')  
        if ${formatFile.is_file()}:
            import yaml
            config = yaml.safe_load(open(${formatFile:r}, 'r'))
        try:
            sumstats = sumstats.loc[:,list(config.values())]
        except:
            raise ValueError(f'According to ${formatFile}, input summary statistics should have the following columns: {list(config.values())}.')
        sumstats.columns = list(config.keys())
        if ${reverse_log_p}:
            sumstats['P'] = sumstats['P'].apply(lambda row: 10**-row)
        sumstats.to_csv(${_output[0]:r}, compression='gzip', sep='\t', header = True, index = False)        

bash: container=container_lmm, expand="$( )"
    # count result SNPs
    for f in $(_input); do echo "$f: `zcat $f | wc -l`"; done > $(_output[1])
    # merge stderr and stdout files
    for f in $(_input); do 
        for ext in stderr stdout log; do
            echo "$f $ext:"
            cat ${f%.gz}.$ext 2>/dev/null || true
            rm -f ${f%.gz}.$ext 
        done
    done > $(_output[0]:n).log

## Manhattan and QQ plots

Before running the pipeline make sure you have installed the necessary packages. We use the `qqman` package from R: https://www.r-graph-gallery.com/101_Manhattan_plot.html


In [3]:
# Manhattan and QQ plots using `qqman`
[boltlmm_3, fastGWA_3, SAIGE_4, regenie_4, regenie_burden_4, GMMAT_2]
# Column name for BP
parameter: bp = 'POS'
# Column name for p-value
parameter: pval = 'P'
# Column name for SNP
parameter: snp = 'SNP'
# Plot only on p-values smaller than this
parameter: p_filter = '0.05'
# ylim set to 0 to use maximum -log10(p) in data
parameter: ylim = 0
sep = '\n\n---\n'
if any(['fastGWA' in step_name]):
    heritability = get_output(f'grep Heritability {_input[0]:n}.log | head -1').strip()
else:
    heritability = None
depends: phenoFile
input: group_by = 2, group_with = 'phenoCol'
output: manhattan = f'{_input[0]:nn}.manhattan.png',
        qq = f'{_input[0]:nn}.qq.png',
        annotated_manhattan = f'{_input[0]:nn}.manhattan_annotated.png',
        analysis_summary = f'{_input[0]:nn}.analysis_summary.md',
        plot_data = f'{_input[0]:nn}.plot_data.rds'       
task: trunk_workers = 1, trunk_size = job_size, walltime = '3h', mem = '48G', tags = f'{step_name}_{_output[0]:bn}'    
bash: container=container_lmm, expand = "${ }"
    echo '''---
    theme: base-theme
    style: |
      img {
        height: 80%;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
    ---    
    ''' > ${_output[3]}
    
R: container=container_lmm, expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # some summary statistics for phenotype
    pheno = read.table(${phenoFile:r}, header=T, sep = '\t')$${_phenoCol}
    if (length(unique(pheno))>2) {
      out = capture.output(summary(pheno))
    } else {
      out = as.data.frame(table(pheno))
      rownames(out) = c('n_ctrl', 'n_case')
      out = out[,2,drop=F]
    }
    write('# ${_phenoCol} result summary\n## Phenotype summary:\n```', ${_output[3]:r}, append = T)
    write.table(out, ${_output[3]:r}, append = T)
    write('${(" Heritability is %s" % heritability) if heritability is not None else ''}', ${_output[3]:r}, append = T)
    write("```", ${_output[3]:r}, append = T)

R: container=container_lmm, expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    library('qqman')
    data <- read.table(gzfile('${_input[0]}'), sep='\t', header=T)
    lambda <- median(qchisq(1-data$${pval},1))/qchisq(0.5,1)
    ifelse((${ylim} == 0 && min(data$${pval}, na.rm=TRUE)!=0), ylim <- abs(floor(log10(min(data$${pval}, na.rm=TRUE)))), ylim <- abs(floor(log10(2.225074e-308))))
    # Creating manhattan plot
    png('${_output[0]}', width = 6, height = 4, unit='in', res=300)
    manhattan_plot <- manhattan(data, chr='CHR', bp='${bp}', snp='${snp}', p='${pval}', main = 'Manhattan plot for ${_phenoCol} (${step_name.rsplit("_",1)[0]})', ylim = c(0, ylim), cex = 0.6, 
    cex.axis = 0.9, col = c("blue4", "orange3"), chrlabs = as.character(c(1:22)))
    dev.off()
    # Creating qqplot
    png('${_output[1]}', width = 5, height = 5, unit='in', res=300)
    qq_plot <- qq(data$${pval}, main = 'QQ Plot for ${_phenoCol} (${step_name.rsplit("_",1)[0]})', xlim = c(0, 8), ylim = c(0, ylim), pch = 18, col = "blue4", cex = 1.5, las = 1)
    dev.off()
    write('## p-value summary:', ${_output[3]:r}, append=T)
    write(paste("Genomic inflation factor is", round(lambda,3), "for", nrow(data), "variants analyzed.${sep}"), ${_output[3]:r}, append=T)
    
  
R: container=container_lmm, expand='${ }', stderr = f'{_output[2]:n}.stderr', stdout = f'{_output[2]:n}.stdout'
    library('dplyr')
    library('ggrepel')
    #Load your data
    data <- read.table(gzfile('${_input[0]}'),sep='\t', header=T)
    # Create a subset of the data with variants with P< 0.05 and arrange by chromosome number
    # https://danielroelfs.com/blog/how-i-create-manhattan-plots-using-ggplot/
    sig.dat <- data %>% 
      subset(${pval} < ${p_filter}) %>%
      arrange (CHR, .by_group=TRUE)
    # Add highlight and annotation information
    #mutate( is_highlight=ifelse(SNP %in% index_snps, "yes", "no")) %>%
    #mutate( is_annotate=ifelse(-log10(P_BOLT_LMM)>6, "yes", "no")) 
    # Check the list of chromosomes (make sure the sex chr are at the end of the list)
    # Get the cumulative base pair position for each variant
    nCHR <- length(unique(sig.dat$CHR))
    sig.dat$BPcum <- NA
    s <- 0
    nbp <- c()
    for (i in unique(sig.dat$CHR)){
      nbp[i] <- max(sig.dat[sig.dat$CHR == i,]$${bp})
      sig.dat[sig.dat$CHR == i,"BPcum"] <- sig.dat[sig.dat$CHR == i,"${bp}"] + s
      s <- s + nbp[i]
    }

    # Calculate the mid point for each chromosome for plotting the x-axis
    # Calculate the y-lim 

    axis.set <- sig.dat %>% 
      group_by(CHR) %>% 
      summarize(center = (max(BPcum) + min(BPcum)) / 2)
    if (${ylim} == 0) ylim <- abs(floor(log10(min(sig.dat$${pval})))) + 2 
    sig <- 5e-8

    # Now time to draw the manhattan plot without filtering the most significant signals
    manhplot <- ggplot(sig.dat, aes(x = BPcum, y = -log10(${pval}), 
                                 color = as.factor(CHR), size = -log10(${pval}))) +
      geom_point(alpha = 0.75) +
      geom_hline(yintercept = -log10(sig), color = "red1", linetype = "dashed") + 
      scale_x_continuous(label = axis.set$CHR, breaks = axis.set$center) +
      scale_y_continuous(expand = c(0,0), limits = c(0, ylim)) +
      scale_color_manual(values = rep(c("#276FBF", "#183059"), nCHR)) +
      scale_size_continuous(range = c(0.5,3)) +
      # Add highlighted points
      # geom_point(data=subset(sig.dat, is_highlight=="yes"), color="orange", alpha=0.75) +
      labs(x = "Chromosome", 
           y = "-log10(p)",
           title ='Manhattan plot for ${_phenoCol} (${step_name.rsplit("_",1)[0]})') + 
      theme_classic() +
      theme( 
        legend.position = "none",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.text.x = element_text(angle = 90, size = 8, vjust = 0.5)
      )

    # To save a plot created with ggplot2 you have to use to print() function

    png('${_output[2]}', width = 6, height = 4, unit='in', res=300)
    print(manhplot)
    dev.off()
  
    # save significant data to a file for further evaluations
    tmp = sig.dat[,c('CHR', '${bp}', 'BPcum', '${snp}', '${pval}')]
    colnames(tmp) = c('CHR', 'POS', 'POScum', 'SNP', 'pvalue')
    saveRDS(list(data = tmp, 
                 ylim = abs(floor(log10(min(sig.dat$${pval})))) + 2,
                 axis.set = axis.set), ${_output[4]:r})

bash: container=container_lmm, expand = True
  set -e
  echo -e "# QQ plot for {_phenoCol}\n" >> {_output[3]}
  echo -e "![]({_output[1]:bn}.png){sep}" >> {_output[3]}
  echo -e "# Manhattan plot for {_phenoCol}\n" >> {_output[3]}
  echo -e "![]({_output[0]:bn}.png){sep}" >> {_output[3]}
  echo -e "# Manhattan plot for {_phenoCol}\n" >> {_output[3]}
  echo -e "![]({_output[2]:bn}.png){sep}" >> {_output[3]}
  echo -e "# Result files\n\`\`\`" >> {_output[3]}
  ls {_input[0]:nn}.* | grep -vP 'stderr|stdout'>> {_output[3]}
  echo -e "\`\`\`" >> {_output[3]}

## Create analysis report

To install `marp`: 
```bash 
npm install -g @marp-team/marp-cli
```

In [2]:
# Generate analysis report: HTML file, and optionally PPTX file
[boltlmm_4, fastGWA_4, SAIGE_5, regenie_5, regenie_burden_5, GMMAT_3]
input: group_by = 5, group_with='phenoCol'
output: f"{_input['analysis_summary']:n}.html"
sh: container=container_marp, expand = True, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    node /opt/marp/.cli/marp-cli.js {_input['analysis_summary']} -o {_output:a} \
        --title '{_phenoCol} {step_name.rsplit("_",1)[0]} analysis' \
        --allow-local-files
    node /opt/marp/.cli/marp-cli.js {_input['analysis_summary']} -o {_output:an}.pptx \
        --title '{_phenoCol} {step_name.rsplit("_",1)[0]} analysis' \
        --allow-local-files

## Results

Take BoltLMM for example, there are some analysis files:

In [None]:
%preview output/phenotypes_BMI.boltlmm.snp_stats.gz

In [None]:
%preview output/phenotypes_BMI.boltlmm.ref_stats.gz

In [None]:
%preview output/phenotypes_BMI.boltlmm.snp_counts.txt

In [None]:
%preview output/phenotypes_BMI.boltlmm.log

The result of analysis will be summarized to a `PPTX` file,

In [None]:
ls output/*.boltlmm.analysis_summary.pptx