# GWAS data QC workflow

This workflow implements some prelimary data QC steps for PLINK input files.

## Overview

This notebook includes workflow for

- Computer kinship matrix in sample and estimate related individuals
- Genotype and sample QC: by MAF, missing data and HWE
- LD pruning

## Run this workflow

Depending on the context of your problem, the workflow can be executed in two ways:

1. Run `merge_plink` if necessary, to merge all samples first; then run `king` to perform kinship estimate and finally `qc` to do addition QC
2. When you have a separate data-set for kinship estimate different from your genotype of interest, you can run `king`, followed by `qc`.

In both cases, you should use the `*.related_remove` output from `king` as the `--remove_samples` parameter input for `qc` step.

## Minimal working example

FIXME: first specify which of the 2 scenarios this example is for, then show how to run it.

### First scenario: estimate kinship

```
sos run ~/bioworkflows/GWAS/GWAS_QC.ipynb king\
    --cwd ~/output \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --name first \ 
    --kinship 0.05
```

### First scenario: do qc

```
sos run ~/bioworkflows/GWAS/GWAS_QC.ipynb qc\
    --cwd ~/output \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --remove_samples ~/output/rename_chr22.first.related_remove \
    --maf_filter 0.5 \
    --geno_filter 0.2 \
    --mind_filter 0.1 \
    --hwe_filter 0.0 \
    --name first \
    --window 50 \
    --shift 10 \
    --r2 0.5 
```

In [1]:
sos run GWAS_QC.ipynb -h

usage: sos run GWAS_QC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  king
  qc
  merge_plink

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --name VAL (as str, required)
                        A string to identify your analysis run
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container-lmm 'statisticalgenetics/lmm:1.8'
                        Software container 

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# A string to identify your analysis run
parameter: name = f"{cwd:b}"
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Software container option
parameter: container_lmm = 'statisticalgenetics/lmm:1.8'
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = f"{cwd:a}"

## Estimate kinship in the sample

The output is a list of related individuals, as well as the kinship matrix

In [None]:
# Inference of relationships in the sample to remove closely related individuals
[king_1]
# Plink binary file
parameter: genoFile = path
input: genoFile
output: f'{cwd}/{_input:bn}.{name}.kin0'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    plink2 \
      --bfile ${_input:n} \
      --make-king-table \
      --out ${_output:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)}
      
[king_2]
# Filter based on kinship coefficient higher than a number (e.g first degree 0.25, second degree 0.125, third degree 0.0625)
parameter: kinship = 0.0625
# If set to true, the unrelated individuals in a family will be kept without being reported. 
# Otherwise (use `--no-maximize-unrelated`) the entire family will be removed
parameter: maximize_unrelated = True
output: f'{_input:n}.related_id'
R:  container=container_lmm, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library(dplyr)
    library(igraph)
    # Remove related individuals while keeping maximum number of individuals
    # this function is simplified from: 
    # https://rdrr.io/cran/plinkQC/src/R/utils.R
    #' @param relatedness [data.frame] containing pair-wise relatedness estimates
    #' (in column [relatednessRelatedness]) for individual 1 (in column
    #' [relatednessIID1] and individual 2 (in column [relatednessIID1]). Columns
    #' relatednessIID1, relatednessIID2 and relatednessRelatedness have to present,
    #' while additional columns such as family IDs can be present. Default column
    #' names correspond to column names in output of plink --genome
    #' (\url{https://www.cog-genomics.org/plink/1.9/ibd}). All original
    #' columns for pair-wise highIBDTh fails will be returned in fail_IBD.
    #' @param relatednessTh [double] Threshold for filtering related individuals.
    #' Individuals, whose pair-wise relatedness estimates are greater than this
    #' threshold are considered related.
    relatednessFilter <- function(relatedness, 
                                  relatednessTh,
                                  relatednessIID1="IID1", 
                                  relatednessIID2="IID2",
                                  relatednessRelatedness="KINSHIP") {
        # format data
        if (!(relatednessIID1 %in% names(relatedness))) {
            stop(paste("Column", relatednessIID1, "for relatedness not found!"))
        }
        if (!(relatednessIID2 %in% names(relatedness))) {
            stop(paste("Column", relatednessIID1, "for relatedness not found!"))
        }
        if (!(relatednessRelatedness %in% names(relatedness))) {
            stop(paste("Column", relatednessRelatedness,
                       "for relatedness not found!"))
        }

        iid1_index <- which(colnames(relatedness) == relatednessIID1)
        iid2_index <- which(colnames(relatedness) == relatednessIID2)

        relatedness[,iid1_index] <- as.character(relatedness[,iid1_index])
        relatedness[,iid2_index] <- as.character(relatedness[,iid2_index])

        relatedness_names <- names(relatedness)
        names(relatedness)[iid1_index] <- "IID1"
        names(relatedness)[iid2_index] <- "IID2"
        names(relatedness)[names(relatedness) == relatednessRelatedness] <- "M"

        # Remove symmetric IID rows
        relatedness_original <- relatedness
        relatedness <- dplyr::select_(relatedness, ~IID1, ~IID2, ~M)

        sortedIDs <- data.frame(t(apply(relatedness, 1, function(pair) {
            c(sort(c(pair[1], pair[2])))
            })), stringsAsFactors=FALSE)
        keepIndex <- which(!duplicated(sortedIDs))

        relatedness_original <- relatedness_original[keepIndex,]
        relatedness <- relatedness[keepIndex,]

        # individuals with at least one pair-wise comparison > relatednessTh
        # return NULL to failIDs if no one fails the relatedness check
        highRelated <- dplyr::filter_(relatedness, ~M > relatednessTh)
        if (nrow(highRelated) == 0) {
            return(list(relatednessFails=NULL, failIDs=NULL))
        }

        # all samples with related individuals
        allRelated <- c(highRelated$IID1, highRelated$IID2)
        uniqueIIDs <- unique(allRelated)

        # Further selection of samples with relatives in cohort
        multipleRelative <- unique(allRelated[duplicated(allRelated)])
        singleRelative <- uniqueIIDs[!uniqueIIDs %in% multipleRelative]

        highRelatedMultiple <- highRelated[highRelated$IID1 %in% multipleRelative |
                                            highRelated$IID2 %in% multipleRelative,]
        highRelatedSingle <- highRelated[highRelated$IID1 %in% singleRelative &
                                           highRelated$IID2 %in% singleRelative,]

        # Only one related samples per individual
        if(length(singleRelative) != 0) {
          # randomly choose one to exclude
          failIDs_single <- highRelatedSingle[,1]
            
        } else {
          failIDs_single <- NULL
        }
  
        # An individual has multiple relatives
        if(length(multipleRelative) != 0) {
            relatedPerID <- lapply(multipleRelative, function(x) {
                tmp <- highRelatedMultiple[rowSums(
                    cbind(highRelatedMultiple$IID1 %in% x,
                          highRelatedMultiple$IID2 %in% x)) != 0,1:2]
                rel <- unique(unlist(tmp))
                return(rel)
            })
            names(relatedPerID) <- multipleRelative

            keepIDs_multiple <- lapply(relatedPerID, function(x) {
                pairwise <- t(combn(x, 2))
                index <- (highRelatedMultiple$IID1 %in% pairwise[,1] &
                              highRelatedMultiple$IID2 %in% pairwise[,2]) |
                    (highRelatedMultiple$IID1 %in% pairwise[,2] &
                         highRelatedMultiple$IID2 %in% pairwise[,1])
                combination <- highRelatedMultiple[index,]
                combination_graph <- igraph::graph_from_data_frame(combination,
                                                                   directed=FALSE)
                all_iv_set <- igraph::ivs(combination_graph)
                length_iv_set <- sapply(all_iv_set, function(x) length(x))

                if (all(length_iv_set == 1)) {
                    # check how often they occurr elsewhere
                    occurrence <- sapply(x, function(id) {
                        sum(sapply(relatedPerID, function(idlist) id %in% idlist))
                    })
                    # if occurrence the same everywhere, pick the first, else keep
                    # the one with minimum occurrence elsewhere
                    if (length(unique(occurrence)) == 1) {
                        nonRelated <- sort(x)[1]
                    } else {
                        nonRelated <- names(occurrence)[which.min(occurrence)]
                    }
                } else {
                    nonRelated <- all_iv_set[which.max(length_iv_set)]
                }
                return(nonRelated)
            })
            keepIDs_multiple <- unique(unlist(keepIDs_multiple))
            failIDs_multiple <- c(multipleRelative[!multipleRelative %in%
                                                       keepIDs_multiple])
        } else {
            failIDs_multiple <- NULL
        }
        allFailIIDs <- c(failIDs_single, failIDs_multiple)
        relatednessFails <- lapply(allFailIIDs, function(id) {
            fail_inorder <- relatedness_original$IID1 == id &
                relatedness_original$M > relatednessTh
            fail_inreverse <- relatedness_original$IID2 == id &
                relatedness_original$M > relatednessTh
            if (any(fail_inreverse)) {
                inreverse <- relatedness_original[fail_inreverse, ]
                id1 <- iid1_index
                id2 <- iid2_index
                inreverse[,c(id1, id2)] <- inreverse[,c(id2, id1)]
                names(inreverse) <- relatedness_names
            } else {
                inreverse <- NULL
            }
            inorder <- relatedness_original[fail_inorder, ]
            names(inorder) <- relatedness_names
            return(rbind(inorder, inreverse))
        })
        relatednessFails <- do.call(rbind, relatednessFails)
        if (nrow(relatednessFails) == 0) {
            relatednessFails <- NULL
            failIDs <- NULL
        } else {
            names(relatednessFails) <- relatedness_names
            rownames(relatednessFails) <- 1:nrow(relatednessFails)
            uniqueFails <- relatednessFails[!duplicated(relatednessFails[,iid1_index]),]
            failIDs <- uniqueFails[,iid1_index]
        }
        return(list(relatednessFails=relatednessFails, failIDs=failIDs))
    }
    
  
    # main code
    kin0 <- read.table(${_input:r}, header=F, stringsAsFactor=F)
    colnames(kin0) <- c("FID1","ID1","FID2","ID2","NSNP","HETHET","IBS0","KINSHIP")
    if (${"TRUE" if maximize_unrelated else "FALSE"}) {
        rel <- relatednessFilter(kin0, ${kinship}, "ID1", "ID2", "KINSHIP")$failIDs
        tmp1 <- kin0[,1:2]
        tmp2 <- kin0[,3:4]
        colnames(tmp1) = colnames(tmp2) = c("FID", "ID")
        # Get the family ID for these rels so there are two columns FID and IID in the output
        lookup <- dplyr::distinct(rbind(tmp1,tmp2))
        dat <- lookup[which(lookup[,2] %in% rel),]
    } else {s
        rel <- kin0 %>% filter(KINSHIP >= ${kinship})
        IID <- sort(unique(unlist(rel[, c("ID1", "ID2")])))
        dat <- data.frame(IID)
        dat <- dat %>%
            mutate(FID = IID) %>%
            select(FID, IID)
    }
    cat("There are", nrow(dat),"related individuals using a kinship threshold of ${kinship}\n")
    write.table(dat,${_output:r}, quote=FALSE, row.names=FALSE, col.names=FALSE)

## Genotype and sample QC

QC the genetic data based on MAF, sample and variant missigness and Hardy-Weinberg Equilibrium (HWE).

In this step you may also provide a list of samples to keep, for example in the case when you would like to subset a sample based on their ancestries to perform independent analyses on each of these groups.

In [2]:
# Filter SNPs and select individuals 
[qc_1 (basic QC filters)]
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
# The path to the file that contains the list of variants to keep
parameter: keep_variants = path('.')
# minimum MAF filter to use. Notice that PLINK default is 0.01
parameter: maf_filter = 0.01
# maximum MAF filter to use
parameter: maf_max_filter = 0.0
# Maximum missingess per-variant
parameter: geno_filter = 0.01
# Maximum missingness per-sample
parameter: mind_filter = 0.02
# HWE filter 
parameter: hwe_filter = 5e-08
# Plink binary files
parameter: genoFile = paths

fail_if(not (keep_samples.is_file() or keep_samples == path('.')), msg = f'Cannot find ``{keep_samples}``')
fail_if(not (keep_variants.is_file() or keep_variants == path('.')), msg = f'Cannot find ``{keep_variants}``')
fail_if(not (remove_samples.is_file() or remove_samples == path('.')), msg = f'Cannot find ``{remove_samples}``')

input: genoFile, group_by=1
output: f'{cwd}/cache/{_input:bn}.{name}.filtered{".extracted" if keep_variants.is_file() else ""}.bed'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, volumes=[f'{cwd}:{cwd}'], expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink2 \
      --bfile ${_input:n} \
      ${('--maf %s' % maf_filter) if maf_filter >= 0 else ''} \
      ${('--max-maf %s' % maf_max_filter) if maf_max_filter > 0 else ''} \
      ${('--geno %s' % geno_filter) if geno_filter >= 0 else ''} \
      ${('--hwe %s' % hwe_filter) if hwe_filter >= 0 else ''} \
      ${('--mind %s' % mind_filter) if mind_filter >= 0 else ''} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      ${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
      ${('--extract %s' % keep_variants) if keep_variants.is_file() else ""} \
      --make-bed \
      --out ${_output:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)}

In [1]:
# LD prunning and remove related individuals (both ind of a pair)
[qc_2 (LD pruning)]
# Window size
parameter: window = 50
# Shift window every 10 snps
parameter: shift = 10
parameter: r2 = 0.1
output: bed=f'{_input:n}.prune.bed', prune=f'{_input:n}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    plink2 \
    --bfile ${_input:n} \
    --indep-pairwise ${window} ${shift} ${r2}  \
    --out ${_output["prune"]:nn} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)}
   
    plink2 \
    --bfile ${_input:n} \
    --extract ${_output['prune']} \
    --make-bed \
    --out ${_output['bed']:n} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)}

In [None]:
# Merge all the .bed files into one bed file 
[qc_3 (merge all files)]
parameter: merged_prefix = ""
merged_prefix = f'{_input[0]:bn}.merged' if merged_prefix == '' else merged_prefix
sos_run("merge_plink", merged_prefix=merged_prefix, genoFile=_input['bed'])

In [None]:
[merge_plink]
parameter: merged_prefix = str
parameter: genoFile = paths
skip_if(len(genoFile) == 1)
input: genoFile, group_by = 'all'
output: f"{_input[0]:d}/{merged_prefix}.bed"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    echo -e ${' '.join([str(x)[:-4] for x in _input[1:]])} | sed 's/ /\n/g' > ${_output:n}.merge_list
    plink2 \
    --bfile ${_input[0]:n} \
    --merge-list ${_output:n}.merge_list \
    --make-bed \
    --out ${_output:n} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)}