# GWAS data QC workflow

This workflow implements some prelimary data QC steps for PLINK input files.

## Overview

This notebook includes workflow for

- Computer kinship matrix in sample and estimate related individuals
- Genotype and sample QC: by MAF, missing data and HWE
- LD pruning

## Run this workflow

Depending on the context of your problem, the workflow can be executed in two ways:

1. Run `merge_plink` if necessary, to merge all samples first; then run `king` to perform kinship estimate and finally `qc` to do addition QC
2. When you have a separate data-set for kinship estimate different from your genotype of interest, you can run `king`, followed by `qc`.

In both cases, you should use the `*.related_id` output from `king` as the `--keep_samples` parameter input for `qc` step.

## Minimal working example

FIXME: first specify which of the 2 scenarios this example is for, then show how to run it.

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Number of threads
parameter: numThreads = 20
# Software container option
parameter: container_lmm = 'statisticalgenetics/lmm:1.8'

## Estimate kinship in the sample

The output is a list of related individuals, as well as the kinship matrix

In [None]:
# Inference of relationships in the sample to remove closely related individuals
[king]
# Filter based on kinship coefficient higher than a number (e.g first degree 0.25, second degree 0.125, third degree 0.0625)
parameter: kinship = 0.0625
# Plink binary file
parameter: genoFile = path
input: genoFile
output: f'{cwd}/{_input:bn}.kin0', related_samples = f'{cwd}/{_input:bn}.related_id'
task: trunk_workers = 1, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    plink2 \
      --bfile ${_input:n} \
      --make-king-table \
      --threads ${numThreads} \
      --out ${_output[0]:n} 
      
R:  container=container_lmm, expand= "${ }", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'
    library(dplyr)
    kin0 <- read.table(${_output[0]:r}, header=F)
    colnames(kin0) <- c("FID1","ID1","FID2","ID2","NSNP","HETHET","IBS0","KINSHIP")
    rel <- kin0 %>%
        filter(KINSHIP >= ${kinship})
    cat("There are", nrow(rel),"related individuals using a kinship threshold of ${kinship}\n")
    IID <- sort(unique(unlist(rel[, c("ID1", "ID2")])))
    df <- data.frame(IID)
    df <- df %>%
        mutate(FID = IID) %>%
        select(FID, IID)
    write.table(df,${_output[1]:r}, quote=FALSE, row.names=FALSE, col.names=FALSE)

## Genotype and sample QC

QC the genetic data based on MAF, sample and variant missigness and Hardy-Weinberg Equilibrium (HWE).

In this step you may also provide a list of samples to keep, for example in the case when you would like to subset a sample based on their ancestries to perform independent analyses on each of these groups.

In [2]:
# Filter SNPs and select individuals 
[qc_1 (basic QC filters)]
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
# MAF filter to use
parameter: maf_filter = 0.01
# Maximum missingess per-variant
parameter: geno_filter = 0.01
# Maximum missingness per-sample
parameter: mind_filter = 0.02
# HWE filter 
parameter: hwe_filter = 5e-08
# Plink binary files
parameter: genoFile = paths
input: genoFile, group_by=1
output: f'{cwd}/cache/{_input:bn}.filtered.bed'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink2 \
      --bfile ${_input:n} \
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} ${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
      --make-bed \
      --threads ${numThreads} \
      --out ${_output:n}

In [1]:
# LD prunning 
[qc_2]
# Window size
parameter: window = 50
# Shift window every 10 snps
parameter: shift = 10
parameter: r2 = 0.1
output: f'{cwd}/cache/{_input:bn}.filtered.prune.in', pruned_bed = f'{cwd}/{_input:bn}.filtered.prune.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output[0]:bn}.stderr', stdout = f'{_output[0]:bn}.stdout'
    plink2 \
    --bfile ${_input:n} \
    --indep-pairwise ${window} ${shift} ${r2}  \
    --out ${_output[0]:nn} \
    --threads ${numThreads} \
    --memory 48000
   
    plink2 \
    --bfile ${_input:n} \
    --extract ${_output[0]} \
    --make-bed \
    --memory 48000 \
    --out ${_output[1]:n} 
    

In [None]:
# Merge all the .bed files into one bed file 
[qc_3 (merge all files)]
parameter: output_prefix = ""
output_prefix = f'{_input[0]:bn}.merged' if output_prefix == '' else output_prefix
sos_run("merge_plink", output_prefix=output_prefix, genoFile=_input)

In [None]:
[merge_plink]
parameter: output_prefix = str
parameter: genoFile = paths
skip_if(len(genoFile) == 1)
input: genoFile, group_by = 'all'
output: f"{_input[0]:d}/{output_prefix}.bed"
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    echo -e ${' '.join([str(x)[:-4] for x in _input[1:]])} | sed 's/ /\n/g' > ${_output:n}.merge_list
    plink \
    --bfile ${_input[0]:n} \
    --merge-list ${_output:n}.merge_list \
    --make-bed \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory 48000

```
sos run ~/GWAS_QC.ipynb qc --cwd ~/test --genoFile /mnt/mfs/statgen/Haoyue/AD/filtered_plink_norm/unrelated_AD.bed --container_lmm /mnt/mfs/statgen/containers/lmm.sif -s build
```