# LMM/GLMM analyses for UK Biobank data

This notebook implements pipelines for analyzing binary and quantitative traits association using BOLT-LMM (version 2.3.4), fastGWA and SAIGE.

## Aim

This pipeline was initially developed to perfom genetic association analysis using various LMM methods on UK Biobank imputed data of ~500K invidivuals, although it can be used to analyze other studies.

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `(.bed/.bim/.fam)` 
    - `--bfile=prefix`
2. Imputed genotype dosages in `bgen` format (`.bgen`, `.bgi`, `.sample`)
    - `--bgenFile` and ` --sampleFile`
3. Phenotype file (white space delimited file with column headers, first two columns should be FID and IID) specify files by options `--phenoFile` and the phenotype to be analized by `--phenoCol`
4. Covariates file (same format as phenoFile) specify them by `--covarFile` for qualitative covariates use `--covarCol` and for quantitative `--qCovarCol`. If `--covarFile` is not specified then phenotype file will be used as covariate file. To specify an array of covariates you can use bash tricks, eg `--qCovarCol PC{1:20}`

Note: reference genome used **GRCh37/hg19**.

## Software specific inputs

### BoltLMM additional input

- Reference genetic maps, provided on BoltLMM website
    - `--geneticMapFile=tables/genetic_map_hg##.txt.gz`
- Reference LD scores, provided on BoltLMM website
- Use `--covarMaxLevels` to specify the number of categories of a qualitative covariate. 

## Output

Our pipeline generates 

1. Summary statistics file for each variant analyzed
2. QQ and Manhattan plots for these summary statistics

## Command interface

In [1]:
sos run LMM.ipynb -h

usage: sos run LMM.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  boltlmm
  gcta
  fastGWA
  regenie
  SAIGE

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --sampleFile VAL (as path, required)
                        Path to sample file
  --bfile VAL (as path, required)
                        Genotype files in plink binary this is used for
                        computing the GRM
  --bgenFile  paths

                        Path to bgen files
  --phenoFile VAL (as path, required)
                        Phenotype file for quantitative trait (BMI)
  --phenoCol VAL (as str, required)
                 

## Global parameter setting

In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# Path to sample file
parameter: sampleFile = path
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path
# Path to bgen files 
parameter: bgenFile = paths
# Phenotype file for quantitative trait (BMI)
parameter: phenoFile = path
# Phenotype to be analyzed (specify the column)
parameter: phenoCol = []
# Covariate file path. Will use phenoFile if empty
parameter: covarFile = path('.')
# Summary statisticss format file path used for unifying output column names. Will not unify names if empty
parameter: formatFile = path('.')
# Qualitative covariates to be used in the analysis
parameter: covarCol = []
# Quantitative covariates to be used in the analysis
parameter: qCovarCol = []
# Specific number of threads to use
parameter: numThreads = int
# Minimum MAF to be used
parameter: bgenMinMAF = float
# Mimimum info score to be used
parameter: bgenMinINFO = float
# For cluster jobs, number commands to run per job
parameter: job_size = 1

if not covarFile.is_file():
    covarFile = phenoFile

## Illustration with minimal working examples

```
JOB_OPT='-j 2'
```

### BOLT-LMM example command

On a minimal working example (MWE) dataset (about 1min to complete the analysis),

```
sos run LMM.ipynb boltlmm \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --bgenFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --formatFile data/boltlmm_template.yml \
    --LDscoresFile BOLT-LMM_v2.3.4/tables/LDSCORE.1000G_EUR.tab.gz \
    --geneticMapFile BOLT-LMM_v2.3.4/tables/genetic_map_hg19_withX.txt.gz \
    --phenoCol BMI \
    --covarCol SEX \
    --covarMaxLevels 10 \
    --qCovarCol AGE \
    --numThreads 5 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --lmm-option none \
    --p-filter 1 \
    $JOB_OPT
```

Please note that the command above is only meant to demonstrate the usage of the pipeline. Data will be generated to a folder called `output`. We set `--lmm-option` to `none` to not run LMM on this minimal data-set. The `--pval` column name `P_LINREG` for QQ/Manhattan plot is also p-value from conventional linear regression. In practice we will definitely want to use one of the LMM options in BoltLMM. Default is `lmm` switch in `bolt` if you don't specify `--lmm-option`.

### fastGWA example command

On a minimal working example (MWE) dataset (analysis completes almost instantly),

```
sos run LMM.ipynb fastGWA \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --bgenFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --formatFile data/fastGWA_template.yml \
    --phenoCol BMI \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 1 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --parts 2 \
    --p-filter 1 \
    $JOB_OPT
```

### REGENIE example command

On a minimal working example (MWE) dataset,

```
sos run ~/project/bioworkflows/GWAS/LMM.ipynb regenie\
    --cwd output \
    --bfile data/genotypes21_22.bed \
    --sampleFile data/imputed_genotypes.sample \
    --bgenFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --formatFile data/regenie_template.yml \
    --phenoCol ASTHMA T2D\
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 8 \
    --bsize 1000 \
    --lowmem output \
    --trait bt \
    --minMAC 4 \
    --bgenMinMAF 0.05 \
    --bgenMinINFO 0.8 \
    --reverse_log_p\
    $JOB_OPT
```

### SAIGE example command

On a minimal working example (MWE) dataset,

```
sos run LMM.ipynb SAIGE \
    --cwd output \
    --bfile data/genotypes.bed \
    --sampleFile data/imputed_genotypes.sample \
    --bgenFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --phenoCol BMI \
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 4 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --trait_type quantitative \
    --pval p.val \
    --bp POS \
    --p-filter 1 \
    $JOB_OPT
```

### Run workflow on a cluster

The shell variable `JOB_OPT` was set to `-j 2`. That is, run 2 jobs in parallel on a local computer (each using 5 threads due to `--numThreads 5`).

On cluster we use a job template, and configure `JOB_OPT` as follows: 

```
JOB_OPT="-c farnam.yml -q farnam -J 40"
```

Here we use task queue `farnam` configured in file `farnam.yml`. We allow for at most 40 jobs in the cluster job queue.

## BoltLMM workflow implementation

To install from source code follow instructions here: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-70002.2
    - On Linux machine a binary executable is provided and can be used.
    - Supporting files such as LD score file and genetic map file can be found [in the installation bundle](https://data.broadinstitute.org/alkesgroup/BOLT-LMM/downloads/BOLT-LMM_v2.3.4.tar.gz).
    - For a complete description on bolt commands go to: http://manpages.ubuntu.com/manpages/eoan/en/man1/bolt.1.html.

**A note for developers**: it is important to have input and output for each step. Input files and output files are best derived from one another.

BOLT-LMM software computes statistics for testing association between phenotypes and genotypes using a linear mixed model


```
--bfile = accepts genotype files in PLINK binary format (.fam, .bed, .bim)
--geneticMapFile = Oxford-format file for interpolating genetic distances: tables/genetic_map_hg##.txt.gz
--phenoFile = phenotype file (header required; FII and IID must be first two columns)
--phenoCol = phenotype columns header
--covarFile = covariate file (header required; FII and IID must be first two columns)
--covarCol = categorical covariate column(s); for >1, use multiple --covarCol and/or {i:j} expansion
--qcovarCol = quantitative covariate column(s); for  >1, use multiple --qCovarCol and/or {i:j} expansion
--lmm = compute assoc stats under the inf model and with Bayesian non-inf prior (VB approx), if power gain expected
--modelSnps = file(s) listing SNPs to use in model (i.e., GRM) (default: use all non-excluded SNPs)
--LDscoresFile = LD Scores for calibration of Bayesian assoc stats: tables/LDSCORE.1000G_EUR.tab.g
--numThreads = number of computational threads
--statsFile = output file for assoc stats at PLINK genotypes
--bgenFile = file(s) containing Oxford BGEN-format genotypes to test for association
--sampleFile = file containing Oxford sample file corresponding to BGEN file(s)
--bgenMinMAF = MAF threshold on Oxford BGEN-format genotypes; lower-MAF SNPs will be ignored
--bgenMinINFO = INFO threshold on Oxford BGEN-format genotypes; lower-INFO SNPs will be ignored
--statsFileBgenSNPs = output file for assoc stats at BGEN-format genotypes
```

It is important to know that BOLT-LMMv2.3.4 accepts bgen files only in 8bit formatting as stated below:

*WARNING: The BGEN format comprises a few sub-formats; we have only implemented support for the versions (and specific data layouts) used in the UK Biobank N=150K and N=500K releases. In particular, for BGEN v1.2, BOLT-LMM currently only supports the 8-bit encoding used for the UK Biobank N=500K data. (Starting with BOLT-LMM v2.3.3, missing values in BGEN v1.2 data are now allowed.)*

In [2]:
# Run BOLT analysis
[boltlmm_1]
# Maximum categories of covariates allowed 
parameter: covarMaxLevels = int
# Path to LDscore file for reference population
parameter: LDscoresFile = path
# Path to genetic map file used to interpolate genetic map coordinates from SNP physical (base pair) positions
parameter: geneticMapFile = path
# LMM option: lmm, lmmInfOnly, and lmmForceNonInf
parameter: lmm_option = 'lmm'
depends: executable("bolt"), LDscoresFile, geneticMapFile
input: bgenFile, group_by = 1
output: f'{cwd}/cache/{_input:bn}.{phenoFile:bn}_{phenoCol}.boltlmm.snp_stats.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bolt \
    --bfile=${bfile:n} \
    --phenoFile=${phenoFile} \
    --phenoCol=${phenoCol} \
    --covarFile=${covarFile} \
    ${' '.join(['--covarCol=%s ' % x for x in covarCol if x is not None])} \
    --covarMaxLevels=${covarMaxLevels} \
    ${' '.join(['--qCovarCol=%s ' % x for x in qCovarCol if x is not None])} \
    --LDscoresFile=${LDscoresFile} \
    --geneticMapFile=${geneticMapFile} \
    ${('--' + lmm_option) if lmm_option in ['lmm', 'lmmInfOnly', 'lmmForceNonInf'] else ''} \
    --statsFile=${_output:nn}.ref_stats.gz \
    --numThreads=${numThreads} \
    --bgenFile=${_input} \
    --bgenMinMAF=${bgenMinMAF} \
    --bgenMinINFO=${bgenMinINFO} \
    --sampleFile=${sampleFile} \
    --statsFileBgenSnps=${_output} \
    --verboseStats

bash: expand = "${ }", active = (_index != 0)
    # remove redundant reference summary stats file
    rm -f ${_output:nn}.ref_stats.gz

bash: expand = "${ }", active = (_index == 0)
    # rename reference summary stats file
    mv ${_output:nn}.ref_stats.gz ${cwd}/${phenoFile:bn}_${phenoCol}.boltlmm.ref_stats.gz

## fastGWA workflow implementation

Installation instructions can be found in https://cnsgenomics.com/software/gcta/#Download. On Linux machine a binary executable is provided and can be used.

Documentation: https://cnsgenomics.com/software/gcta/#fastGWA

### Step 1: Creation of the GRM
The GRM only needs to be created once for all the phenotypes to analyze with the same genotypic data. In this step the GRM calculation is divided into multiple parts for a faster computational time.

In [None]:
# Partition the GRM into 100 parts and allocate 8GB memory to each job
[gcta_1]
depends: executable("gcta64")
# Number of parts the GRM calculation is to be partitioned
parameter: parts = 100
part_number = [f'{parts}_{format(x+1, "0" + str(len(str(parts))))}' for x in range(parts)]
input: bfile, for_each = 'part_number'
output: f'{cwd}/cache/{_input:bn}.part_{_part_number}.grm.bin', 
        f'{cwd}/cache/{_input:bn}.part_{_part_number}.grm.N.bin', 
        f'{cwd}/cache/{_input:bn}.part_{_part_number}.grm.id'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    gcta64 \
    --bfile ${_input[0]:n} \
    --make-grm-part ${parts} ${_index+1} \
    --thread-num ${numThreads} \
    --out ${_output[0]:nnn}

### Step 2: Combine all the GRM parts into one file

In [None]:
# Merge all the parts together (Linux, Mac)
[gcta_2]
input: group_by = 'all'
output: f'{cwd}/{bfile:bn}.grm.bin', 
        f'{cwd}/{bfile:bn}.grm.N.bin', 
        f'{cwd}/{bfile:bn}.grm.id' 
task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '6G', cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # here input is results all parts each having 3 items. We need to get the corresponding every other 3 items
    cat ${paths(_input[::3])} > ${_output[0]}
    cat ${paths(_input[1::3])} > ${_output[1]}
    cat ${paths(_input[2::3])} > ${_output[2]}
    #rm ${paths(_input)}

### Step 3: Make a sparse GRM to be used in the association analyses

In [None]:
# Make a sparse GRM from the merged full-dense GRM
[gcta_3]
depends: executable("gcta64")
output: f'{cwd}/{bfile:bn}.grm.sp' 
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = 1, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    gcta64 --grm ${_output:nn} --make-bK-sparse 0.05 --out ${_output:nn}

### Step 4: Run the single variant association analysis using FastGWA

In [None]:
# fastGWA mixed model (based on the sparse GRM generated above)
[fastGWA_1]
parameter: grmFile = path(f'{cwd}/{bfile:bn}.grm.sp')
depends: executable("gcta64"), grmFile
# extract and prepare phenotype & covariate files
import pandas as pd
dat = pd.read_csv(phenoFile, header=0, delim_whitespace=True)
if len(phenoCol) == 1:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.fastGWA_phenotype", sep=' ', index=False, columns = ['FID', 'IID'] + phenoCol)
dat = pd.read_csv(covarFile, header=0, delim_whitespace=True)
if len(covarCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.fastGWA_covar", sep=' ', index=False, columns = ['FID', 'IID'] + covarCol)
if len(qCovarCol) > 0:
    dat.to_csv(f"{cwd}/{phenoFile:bn}.fastGWA_qcovar", sep=' ', index=False, columns = ['FID', 'IID'] + qCovarCol)

input: bgenFile, group_by = 1, group_with = dict(info=[(path(f"{cwd}/{phenoFile:bn}.fastGWA_phenotype"), sampleFile, grmFile,
                                                        path(f"{cwd}/{covarFile:bn}.fastGWA_qcovar"), path(f"{cwd}/{covarFile:bn}.fastGWA_covar"))] * len(bgenFile))
output: f'{cwd}/cache/{_input:bnn}.{phenoFile:bn}.fastGWA.gz'
fail_if(not path(f'{_input}.bgi').is_file(), msg = f'Cannot find file ``{_input}.bgi``. Please generate it using command ``bgenix -g {_input} -index``.')
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    gcta64 \
    --bgen ${_input} \
    --sample  ${_input.info[1]} \
    --grm-sparse ${_input.info[2]:nn} \
    --maf ${bgenMinMAF} \
    --info ${bgenMinINFO} \
    --fastGWA-mlm \
    --pheno ${_input.info[0]} \
    --qcovar ${_input.info[3]} \
    --covar ${_input.info[4]} \
    --threads ${numThreads} \
    --out ${_output:nn} \
    && gzip -f --best ${_output:n}

Output from each step:

1. **gcta_1 for x number of parts (in the example above x=100, so this step will create 400 files):**
* test.part_{_part_number}.grm.bin
* test.part_{_part_number}.grm.N.bin 
* test.part_{_part_number}.grm.id
* test.part_{_part_number}.log (the program creates the log file so there is no need for .stderr and .stdout)

2. **gcta_2 this step creates 5 output files:**
* test.grm.bin (it is a binary file which contains the lower triangle elements of the GRM)
* test.grm.N.bin (it is a binary file which contains the number of SNPs used to calculate the GRM)
* test.grm.id (no header line; columns are family ID and individual ID, see above)
* test.grm.stderr
* test.grm.stdout

3. **gcta_3 this step creates 3 output files:**
* test.grm.sp (sparse GRM made from the dense GRM)
* test.grm.sp.stderr
* test.grm.sp.stdout

4. **fastGWA this step creates 2 output files per chromosome**
* test{chr1:22}.fastGWA
* test{chr1:22}.fastGWA.log

# REGENIE workflow implementation

Documentation can be found [here](https://rgcgithub.github.io/regenie/). Binary and quantitative traits should be analyzed separately. 

In [None]:
# Select the SNPs and samples to be used based on maf, geno, hwe and mind options
[plink: provides = f'{cwd}/cache/{bfile:bn}.qc_pass.id']
parameter: plink2_module = '''
module load PLINK/2_x86_64_20180428
echo "Module plink2 loaded"
{cmd}
'''
input: bfile
output: f'{cwd}/cache/{bfile:bn}.qc_pass.id', f'{cwd}/cache/{bfile:bn}.qc_pass.snplist' 
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', template = '{cmd}' if executable('plink2').target_exists() else plink2_module    
    plink2 \
      --bfile ${bfile:n} \
      --maf 0.001 \
      --geno 0.01 \
      --hwe 5e-08 \
      --mind 0.01 \
      --write-snplist --write-samples --no-id-header \
      --threads ${numThreads} \
      --out ${_output[0]:n} 

## Step 1: fitting the null

In [1]:
# Run REGENIE step 1: fitting the null
[regenie_1]
# Size of the genotype blocks to be used 
parameter: bsize = int
# Path to temporarily store block predictions
parameter: lowmem = path('.')
# Specify that traits are binary with 0=control,1=case,NA=missing (default is quantitative)
parameter: trait = 'bt'
# extract and prepare phenotype & covariate files
import pandas as pd
import numpy as np
dat = pd.read_csv(phenoFile, header=0, delim_whitespace=True, dtype=str)
dat = dat.replace(to_replace =np.nan, value ="NA")
if len(phenoCol) > 0:    
    dat.to_csv(f"{cwd}/{phenoFile:bn}.regenie_phenotype", sep=' ', index=False, columns = ['FID', 'IID'] + phenoCol)
dat = pd.read_csv(covarFile, header=0, delim_whitespace=True)
if len(covarCol) > 0 or len(qCovarCol) > 0:
    dat.replace(to_replace =np.nan, value ="NA")
    dat.to_csv(f"{cwd}/{phenoFile:bn}.regenie_covar", sep=' ', index=False, columns = ['FID', 'IID'] + covarCol + qCovarCol)
depends: f'{cwd}/cache/{bfile:bn}.qc_pass.snplist', f'{cwd}/cache/{bfile:bn}.qc_pass.id'
input: bfile, f"{cwd}/{phenoFile:bn}.regenie_phenotype", f"{cwd}/{phenoFile:bn}.regenie_covar"
output: f'{cwd}/{phenoFile:bn}_' + "_".join([x for x in phenoCol]) + '.regenie_pred.list'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '15G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    regenie \
      --step 1 \
      --bed ${_input[0]:n} \
      --phenoFile ${_input[1]} \
      --covarFile ${_input[2]} \
      --extract ${_depends[0]} \
      --keep ${_depends[1]} \
      ${('--' + trait) if trait in ['bt'] else ''} \
      --bsize ${bsize} \
      --lowmem ${lowmem} \
      --threads ${numThreads} \
      --out ${_output:nn}.regenie

## Step 2: association analysis

In [1]:
# Run REGENIE step 2: association analysis
[regenie_2]
# Mimimum allele count to be used
parameter: minMAC = int
parameter: trait = 'bt'
depends: f'{cwd}/cache/{bfile:bn}.qc_pass.id'
input: bgenFile, group_by = 1, group_with = dict(info=[(path(f"{cwd}/{phenoFile:bn}.regenie_phenotype"),
                                                        path(f"{cwd}/{covarFile:bn}.regenie_covar"),
                                                        path(f'{cwd}/{phenoFile:bn}_' + "_".join([x for x in phenoCol]) + '.regenie_pred.list'))] * len(bgenFile))
output: [f'{cwd}/{_input:bn}_'+ str(phenoCol[i]) + '.regenie' for i in range(len(phenoCol))]
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '15G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash:expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    regenie \
     --step 2 \
     --bgen ${_input} \
     --keep ${_depends[0]} \
     --phenoFile ${_input.info[0]} \
     --covarFile ${_input.info[1]} \
     --phenoColList ${' '.join(phenoCol)} \
     ${('--' + trait) if trait in ['bt'] else ''} \
     --firth 0.01 --approx \
     --pred ${_input.info[2]} \
     --bsize 400 \
     --minMAC ${minMAC} \
     --split \
     --threads ${numThreads} \
     --out ${str(_output[0]).rsplit('_',1)[0]}

In [None]:
[regenie_3]
input: group_by = 1
output: f'{cwd}/{_input:bn}.regenie.gz'
bash: expand = "${ }"
 gzip -f --best ${_output:n}

## SAIGE workflow implementation

We need to create a conda enviroment for the installation of SAIGE in Yale's HRC cluster. Instructions in https://github.com/weizhouUMICH/SAIGE

### Step 1: fitting the null

In [None]:
# Fit SAIGE null model
[SAIGE_1]
# trait type, eg 'binary' or 'quantitative'
parameter: trait_type = str
# Whether to use LOCO or not
parameter: loco = 'TRUE'
# Name of the sample column
parameter: sampleCol='IID'
#Path specific to SAIGE script
parameter: script_path = path('~/software/bin/step1_fitNULLGLMM.R')
# Inverse normalization only for non-normal quantitative traits
parameter: invNormalize = 'FALSE'
input: bfile, phenoFile
output: f'{cwd}/{bfile:bn}.{phenoFile:bn}.SAIGE.rda', f'{cwd}/{bfile:bn}.{phenoFile:bn}.SAIGE.varianceRatio.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', template_name='conda', env_name='RSAIGE'
    Rscript ${script_path} \
        --plinkFile=${_input[0]:n} \
        --phenoFile=${_input[1]} \
        --phenoCol=${phenoCol} \
        ${('--covarColList=' + ','.join(covarCol + qCovarCol)) if len(covarCol + qCovarCol) else ''} \
        --sampleIDColinphenoFile=${sampleCol} \
        --traitType=${trait_type} \
        --outputPrefix=${_output[0]:n} \
        --nThreads=${numThreads} \
        --LOCO=${loco} \
        --invNormalize=${invNormalize} \
        --IsOverwriteVarianceRatioFile=TRUE

### Step 2: perform single variant association test

In [None]:
# Compute SAIGE statistics
[SAIGE_2]
# Mimimum allele count to be used
parameter: bgenMinMAC = 4
#Specify whether to output allele frequencies in cases and controls
parameter: af_caco = 'TRUE'
#Path specific to SAIGE script
parameter: script_path = path('~/software/bin/step2_SPAtests.R')
# Fix SAIGE non-standard sample file input
import pandas as pd
dat = pd.read_csv(sampleFile, header=0, skiprows=lambda x: x == 1, delim_whitespace=True)
dat.to_csv(f"{cwd}/{sampleFile:bn}.SAIGE_sample", sep=' ', index=False, header=False, columns = [dat.columns[0]])

input: for_each='bgenFile'
output: f'{cwd}/cache/{_bgenFile:bn}.{phenoFile:bn}.SAIGE.gz'
fail_if(not path(f'{_bgenFile}.bgi').is_file(), msg = f'Cannot find file ``{_bgenFile}.bgi``. Please generate it using command ``bgenix -g {_bgenFile} -index``.')
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template_name='conda', env_name='RSAIGE'
    Rscript ${script_path} \
        --bgenFile=${_bgenFile} \
        --bgenFileIndex=${_bgenFile}.bgi \
        --minMAF=${bgenMinMAF} \
        --minMAC=${bgenMinMAC} \
        --minInfo=${bgenMinINFO} \
        --sampleFile=${cwd}/${sampleFile:bn}.SAIGE_sample \
        --GMMATmodelFile=${_input[0]} \
        --varianceRatioFile=${_input[1]} \
        --SAIGEOutputFile=${_output:n} \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=${af_caco} \
        && sed '1 s/rsid //' -i ${_output:n} \
        && gzip -f --best ${_output:n} \
        && mv ${_output:n}.bgen.txt.gz ${_output}

Output from each step:

**From step 1**

1. Model file: `${_output}.rda`

2. Association result file for the subset of randomly selected markers: `${_output}.results.txt`

3. Variance ratio file: `${_output}.varianceRatio.txt`

**From step 2**

1. A file with association results for each chromosome (Note: this are given in regard to Allele 2)

## Merge results

In [None]:
# Merge results and log files
[boltlmm_2, fastGWA_2, SAIGE_3, regenie_4]
parameter:reverse_log_p = False
input: group_by = 'all'
output: f'{cwd}/{phenoFile:bn}_' + '_'.join(['%s' % x for x in phenoCol if x is not None]) + f'.{step_name.rsplit("_",1)[0]}' + '.snp_stats.gz', 
        f'{cwd}/{phenoFile:bn}_' + '_'.join(['%s' % x for x in phenoCol if x is not None]) + f'.{step_name.rsplit("_",1)[0]}' + '.snp_counts.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '30m', mem = '6G', cores = 1, tags = f'{step_name}_{_output[0]:bn}'
python: expand ='${ }'
    import gzip
    n_lines = -1
    if ${formatFile.is_file()}:
        output = '${_output[0]:n}' + '_original_columns' + '${_output[0]:x}'
    else:
        output = '${_output[0]}'
    with gzip.open(output, 'wt') as outfile:
        with gzip.open(${_input[0]:r}) as f:
            for line in f:
                outfile.write(line.decode('utf-8'))
            for files in [${_input:r,}][1:]:
                with gzip.open(files) as f:
                    i = 0
                    for line in f:
                        if i > 0:
                            outfile.write(line.decode('utf-8'))
                        i += 1
    # unify output format
    if ${formatFile.is_file()}:
        import yaml
        import pandas as pd
        sumstats = pd.read_csv(output, compression='gzip', header=0, delim_whitespace=True, quotechar='"')  
        config = yaml.safe_load(open(${formatFile:r}, 'r'))
    try:
        sumstats = sumstats.loc[:,list(config.values())]
    except:
        raise ValueError(f'According to ${formatFile}, input summary statistics should have the following columns: {list(config.values())}.')
    sumstats.columns = list(config.keys())
    if ${reverse_log_p}:
        sumstats['P'] = sumstats['P'].apply(lambda row: 10**-row)
        sumstats.to_csv(${_output[0]:r}, compression='gzip', sep='\t', header = True, index = False)        

bash: expand="$( )"
    # count result SNPs
    for f in $(_input); do echo "$f: `zcat $f | wc -l`"; done > $(_output[1])
    # merge stderr and stdout files
    for f in $(_input); do 
        for ext in stderr stdout log; do
            echo "$f $ext:"
            cat ${f%.gz}.$ext 2>/dev/null || true
            rm -f ${f%.gz}.$ext 
        done
    done > $(_output[0]:n).log

## Manhattan and QQ plots

Before running the pipeline make sure you have installed the necessary packages. We use the `qqman` package from R: https://www.r-graph-gallery.com/101_Manhattan_plot.html


In [None]:
# Manhattan and QQ plots using `qqman`
[boltlmm_3, fastGWA_3, SAIGE_4, regenie_5]
depends: R_library('qqman'), R_library('dplyr'), R_library('ggrepel'), R_library('ggplot2')
# Column name for BP
parameter: bp = 'POS'
# Column name for p-value
parameter: pval = 'P'
# Column name for SNP
parameter: snp = 'SNP'
# Plot only on p-values smaller than this
parameter: p_filter = '0.05'
# ylim set to 0 to use maximum -log10(p) in data
parameter: ylim = 0
sep = '\n\n---\n'
output: manhattan = f'{_input[0]:nn}.manhattan.png',
        qq = f'{_input[0]:nn}.qq.png',
        annotated_manhattan = f'{_input[0]:nn}.manhattan_annotated.png',
        analysis_summary = f'{_input[0]:nn}.analysis_summary.md',
        plot_data = f'{_input[0]:nn}.plot_data.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = '3h', mem = '48G', tags = f'{step_name}_{_output[0]:bn}'
    
bash: expand = "${ }"
    echo '''---
    theme: base-theme
    style: |
      img {
        height: 80%;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
    ---    
    ''' > ${_output[3]}
    
R: expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    phenoCol = c(${", ".join([repr(x) for x in phenoCol])})
    len=1
    while (len<=length(phenoCol)) 
    {    # some summary statistics for phenotype
        pheno = read.table(${phenoFile:r}, header=T)$phenoCol[len]
        if (length(unique(pheno))>2) {
          out = capture.output(summary(pheno))
        } else {
          out = as.data.frame(table(pheno))
          rownames(out) = c('n_ctrl', 'n_case')
          out = out[,2,drop=F]
        }
    write(paste('#', phenoCol[len], 'result summary\n## Phenotype summary:\n```'), ${_output[3]:r}, append = T)
    write.table(out, ${_output[3]:r}, append = T)
    write("```", ${_output[3]:r}, append = T)
    len=len+1}

R: expand='${ }', stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    library('qqman')
    data <- read.table(gzfile('${_input[0]}'), header=T)
    lambda <- median(qchisq(1-data$${pval},1))/qchisq(0.5,1)
    # Creating manhattan plot
    png('${_output[0]}', width = 6, height = 4, unit='in', res=300)
    manhattan_plot <- manhattan(data, chr='CHR', bp='${bp}', snp='${snp}', p='${pval}', main = 'Manhattan plot for ${phenoCol} (${step_name.rsplit("_",1)[0]})', ylim = c(0, 250), cex = 0.6, 
    cex.axis = 0.9, col = c("blue4", "orange3"), suggestiveline = T, genomewideline = T, chrlabs = as.character(c(1:22)))
    dev.off()
    # Creating qqplot
    png('${_output[1]}', width = 5, height = 5, unit='in', res=300)
    qq_plot <- qq(data$${pval}, main = 'QQ Plot for ${phenoCol} (${step_name.rsplit("_",1)[0]})', xlim = c(0, 8), ylim = c(0, 300), pch = 18, col = "blue4", cex = 1.5, las = 1)
    dev.off()
    write('## p-value summary:', ${_output[3]:r}, append=T)
    write(paste("Genomic inflation factor is", round(lambda,3), "for", nrow(data), "variants analyzed.${sep}"), ${_output[3]:r}, append=T)
    
  
R: expand='${ }', stderr = f'{_output[2]:n}.stderr', stdout = f'{_output[2]:n}.stdout'
    library('dplyr')
    library('ggrepel')
    #Load your data
    data <- read.table(gzfile('${_input[0]}'), header=T)
    # Create a subset of the data with variants with P< 0.05 and arrange by chromosome number
    # https://danielroelfs.com/blog/how-i-create-manhattan-plots-using-ggplot/
    sig.dat <- data %>% 
      subset(${pval} < ${p_filter}) %>%
      arrange (CHR, .by_group=TRUE)
    # Add highlight and annotation information
    #mutate( is_highlight=ifelse(SNP %in% index_snps, "yes", "no")) %>%
    #mutate( is_annotate=ifelse(-log10(P_BOLT_LMM)>6, "yes", "no")) 
    # Check the list of chromosomes (make sure the sex chr are at the end of the list)
    # Get the cumulative base pair position for each variant
    nCHR <- length(unique(sig.dat$CHR))
    sig.dat$BPcum <- NA
    s <- 0
    nbp <- c()
    for (i in unique(sig.dat$CHR)){
      nbp[i] <- max(sig.dat[sig.dat$CHR == i,]$${bp})
      sig.dat[sig.dat$CHR == i,"BPcum"] <- sig.dat[sig.dat$CHR == i,"${bp}"] + s
      s <- s + nbp[i]
    }

    # Calculate the mid point for each chromosome for plotting the x-axis
    # Calculate the y-lim 

    axis.set <- sig.dat %>% 
      group_by(CHR) %>% 
      summarize(center = (max(BPcum) + min(BPcum)) / 2)
    if (${ylim} == 0) ylim <- abs(floor(log10(min(sig.dat$${pval})))) + 2 
    sig <- 5e-8

    # Now time to draw the manhattan plot without filtering the most significant signals
    manhplot <- ggplot(sig.dat, aes(x = BPcum, y = -log10(${pval}), 
                                 color = as.factor(CHR), size = -log10(${pval}))) +
      geom_point(alpha = 0.75) +
      geom_hline(yintercept = -log10(sig), color = "red1", linetype = "dashed") + 
      scale_x_continuous(label = axis.set$CHR, breaks = axis.set$center) +
      scale_y_continuous(expand = c(0,0), limits = c(0, ylim)) +
      scale_color_manual(values = rep(c("#276FBF", "#183059"), nCHR)) +
      scale_size_continuous(range = c(0.5,3)) +
      # Add highlighted points
      # geom_point(data=subset(sig.dat, is_highlight=="yes"), color="orange", alpha=0.75) +
      labs(x = "Chromosome", 
           y = "-log10(p)",
           title ='Manhattan plot for ${phenoCol} (${step_name.rsplit("_",1)[0]})') + 
      theme_classic() +
      theme( 
        legend.position = "none",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.text.x = element_text(angle = 90, size = 8, vjust = 0.5)
      )

    # To save a plot created with ggplot2 you have to use to print() function

    png('${_output[2]}', width = 6, height = 4, unit='in', res=300)
    print(manhplot)
    dev.off()
  
    # save significant data to a file for further evaluations
    tmp = sig.dat[,c('CHR', '${bp}', 'BPcum', '${snp}', '${pval}')]
    colnames(tmp) = c('CHR', 'POS', 'POScum', 'SNP', 'pvalue')
    saveRDS(list(data = tmp, 
                 ylim = abs(floor(log10(min(sig.dat$${pval})))) + 2,
                 axis.set = axis.set), ${_output[4]:r})

bash: expand = True
  set -e
  echo -e "# QQ plot for {phenoCol}\n" >> {_output[3]}
  echo -e "![]({_output[1]:bn}.png){sep}" >> {_output[3]}
  echo -e "# Manhattan plot for {phenoCol}\n" >> {_output[3]}
  echo -e "![]({_output[0]:bn}.png){sep}" >> {_output[3]}
  echo -e "# Manhattan plot for {phenoCol}\n" >> {_output[3]}
  echo -e "![]({_output[2]:bn}.png){sep}" >> {_output[3]}
  echo -e "# Result files\n\`\`\`" >> {_output[3]}
  ls {_input[0]:nn}.* | grep -vP 'stderr|stdout'>> {_output[3]}
  echo -e "\`\`\`" >> {_output[3]}

## Create analysis report

To install `marp`: 
```bash 
npm install -g @marp-team/marp-cli
```

In [None]:
# Generate analysis report: HTML file, and optionally PPTX file
[boltlmm_4, fastGWA_4, SAIGE_5, regenie_6]
depends: executable('marp')
output: f"{_input['analysis_summary']:n}.html"
bash: workdir = cwd, expand = True
    set -e
    marp {_input['analysis_summary']:b} -o {_output:a} \
        --title '{phenoCol} {step_name.rsplit("_",1)[0]} analysis' \
        --allow-local-files || true
    marp {_input['analysis_summary']:b} -o {_output:an}.pptx \
        --title '{phenoCol} {step_name.rsplit("_",1)[0]} analysis' \
        --allow-local-files

## Results

Take BoltLMM for example, there are some analysis files:

In [12]:
%preview output/phenotypes_BMI.boltlmm.snp_stats.gz

SNP	CHR	BP	GENPOS	ALLELE1	ALLELE0	A1FREQ	INFO	CHISQ_LINREG	P_LINREG
rs79945276	21	48096251	0.646473	T	G	0.0640784	0.96222	4.23679	4.0E-02
rs12481825	21	48096617	0.646484	A	C	0.0153529	0.977965	0.712511	4.0E-01
rs61504104	21	48096920	0.646493	C	T	0.0887647	0.975897	0.0796273	7.8E-01
rs55777714	21	48097101	0.646499	T	C	0.169882	0.959507	0.128473	7.2E-01


In [13]:
%preview output/phenotypes_BMI.boltlmm.ref_stats.gz

SNP	CHR	BP	GENPOS	ALLELE1	ALLELE0	A1FREQ	F_MISS	CHISQ_LINREG	P_LINREG
rs3131962	1	756604	0.00490722	A	G	0.165	0	0.0284453	8.7E-01
rs12562034	1	768448	0.00495714	A	G	0.07	0	1.03484	3.1E-01
rs4040617	1	779322	0.00500708	G	A	0.155	0	0.133342	7.1E-01
rs79373928	1	801536	0.0058722	G	T	0.02	0	0.0409388	8.4E-01


In [14]:
%preview output/phenotypes_BMI.boltlmm.snp_counts.txt

output/imputed_genotypes_chr21.phenotypes_BMI.boltlmm.snp_stats.gz: 25
output/imputed_genotypes_chr22.phenotypes_BMI.boltlmm.snp_stats.gz: 22

In [15]:
%preview output/phenotypes_BMI.boltlmm.log

output/imputed_genotypes_chr21.phenotypes_BMI.boltlmm.snp_stats.gz stderr:
NOTE: Using all-1s vector (constant term) in addition to specified covariates
NOTE: Using all-1s vector (constant term) in addition to specified covariates
NOTE: Using all-1s vector (constant term) in addition to specified covariates
NOTE: Using all-1s vector (constant term) in addition to specified covariates

The result of analysis will be summarized to a `PPTX` file,

In [4]:
ls output/*.boltlmm.analysis_summary.pptx

output/phenotypes_BMI.boltlmm.analysis_summary.pptx
