# Principal Components Analysis

This notebook will conduct the final data preparation steps and perform PCA, generating plots and summary statistics.

## Overview

Steps to generate a PCA include 

- removing related individuals
- pruning variants in linkage disequilibrium (LD)
- perform PCA analysis on genotype of unrelated individuals
- excluding outlier samples in the PCA space for individuals of homogeneous self-reported ancestry. These outliers may suggest poor genotyping quality or distant relatedness.

Pitfalls

1. Some of the PCs may capture LD structure rather than population structure (decrease in power to detect associations in these regions of high LD)
2. When projecting a new study dataset to the PCA space computed from a reference dataset: projected PCs are shrunk toward 0 in the new dataset
3. PC scores may capture outliers that are due to family structure, population structure or other reasons; it might be beneficial to detect and remove these individuals to maximize the population structure captured by PCA (in the case of removing a few outliers) or to restrict analyses to genetically homogeneous samples

## Method

Here is a quick recap of PCA analysis for those not immediately familiar with the method. PCA is a mathematical method to reduce dimensionality of the data while retaining most of the variation in the dataset. 
This is accomplished by identifying directions or Principal Components (PC's) that account for the maximum variation in the data. 

One common approach to PCA is based on the singular-value decomposition of the the data matrix $X$ (in our case the genotype matrix),

$$X = U D V^T,$$

where $U$ are the left eigenvectors, $D$ is the diagonal matrix of singular values, and $V$ are the right eigenvectors (also called loadings). 

PCA can also be done using the eigen-decomposition of $X X^T$:

$$X X^T = U S U^T,$$ 

where $S=D^2$ is the diagonal matrix of eigenvalues.
$X$ is usually centred (mean-subtracted) or standardised (mean subtracted, then divided by standard deviation) before PCA.

For PCA of SNP genotypes (at least in diploid organisms), the common standardisation is

$$X_{ij}^{\prime} = \frac{X_{ij} - 2p_j}{\sqrt{2 p_j (1 - p_j)}},$$

where $X_{ij}$ is the genotype (minor allele dosage $\{0, 1, 2\}$) for the $i$th individual and the $j$th SNP, and $p_j$ is the minor allele frequency (MAF) for the $j$th SNP. In addition, the eigenvalues are scaled by the number of SNPs $m$ (equivalent to performing the eigen-decomposition of $XX^T/m$).

## Workflow

1. Estimate relatedness of the individuals in the sample by PLINK 2 that implements the KING algorithm
2. Select specific SNPs and samples using PLINK and remove related individuals
3. SNPs thining by doing LD-pruning 

The above steps are implemented in `GWAS_QC.ipynb` workflow.

4. Run PCA using only unrelated individuals for all populations, and examine the resulting plot
5. Project back related individuals, and generate a list of suggested samples to remove based on Mahalanobis distance test statistic per population. Default criteria is 0.997 percentile (two-sided) but we recommend checking the output plot before and after removal and rethink about it.

The analysis above can be performed with reference data eg 1000 Genomes integrated, to help diagnose population substructure in data.

If you have subpopulations in the data, then additional steps should be applied for:

6. Split data into different populations, each population data should have both related vs unrelated individual data-sets
7. For each population, perform QC
8. For each population, re-calculate per population PC's for unrelated individuals
9. For each population, project related samples back to the PC space
10. Remove outliers based on list previously generated

## Input

1. `protocol_example.genotype.chr21_22.fam`
2. `protocol_example.protein.csv`
3. `protocol_example.genotype.chr21_22.bed`

## Output

1. PCA models and scores (`.rds` file)
2. Mahalanobis Distance Summary Statistics (`.md` file)
3. Mahalanobis Distance Histogram & QQ-plots (`.png` file)
4. Scree Plot and Cumulative PVE Plot (`.png` file)


## Minimal Working Example

The proteomics data used in this MWE can be found on [synapse](https://www.synapse.org/#!Synapse:syn52369482).

### Step 1: Sample match with genotype
Timing: < 1 minute

In [None]:
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
        --cwd output/sample_meta \
        --genoFile input/protocol_example.genotype.chr21_22.fam  \
        --phenoFile input/protocol_example.protein.csv \
        --container singularity/bioinfo.sif \
        --mem 5G

```
INFO: Running genotype_phenotype_sample_overlap: This workflow extracts overlapping samples for genotype data with phenotype data, and output the filtered sample genotype list as well as sample phenotype list
INFO: genotype_phenotype_sample_overlap is completed.
INFO: genotype_phenotype_sample_overlap output:   /Users/alexmccreight/xqtl-pipeline-new/output/sample_meta/protocol_example.protein.sample_overlap.txt /Users/alexmccreight/xqtl-pipeline-new/output/sample_meta/protocol_example.protein.sample_genotypes.txt
INFO: Workflow genotype_phenotype_sample_overlap (ID=waecd9cbee7d661b4) is executed successfully with 1 completed step.
```

### Step 2: Kinship quality control
Timing: < 1 minute

In [None]:
sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/kinship \
    --genoFile input/protocol_example.genotype.chr21_22.bed \
    --name pQTL \
    --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
    --container singularity/bioinfo.sif \
    --no-maximize-unrelated \
    --mem 40G

```
INFO: Running king_1: Inference of relationships in the sample to identify closely related individuals
INFO: king_1 is completed.
INFO: king_1 output:   /Users/alexmccreight/xqtl-pipeline-new/output/kinship/protocol_example.genotype.chr21_22.pQTL.kin0
INFO: Running king_2: Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data
INFO: king_2 is completed.
INFO: king_2 output:   /Users/alexmccreight/xqtl-pipeline-new/output/kinship/protocol_example.genotype.chr21_22.pQTL.related_id
INFO: Running king_3: Split genotype data into related and unrelated samples, if related individuals are detected
INFO: king_3 is completed.
INFO: king_3 output:   /Users/alexmccreight/xqtl-pipeline-new/output/kinship/protocol_example.genotype.chr21_22.pQTL.unrelated.bed /Users/alexmccreight/xqtl-pipeline-new/output/kinship/protocol_example.genotype.chr21_22.pQTL.related.bed
INFO: Workflow king (ID=w2be67e3f13a70573) is executed successfully with 3 completed steps.
```

### Step 3: Prepare unrelated individuals data for PCA
Timing: < 1 minute

In [None]:
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile output/kinship/protocol_example.genotype.chr21_22.pQTL.unrelated.bed \
   --mac-filter 5 \
   --container singularity/bioinfo.sif \
   --mem 16G

```
INFO: Running basic QC filters: Filter SNPs and select individuals
INFO: basic QC filters is completed.
INFO: basic QC filters output:   /Users/alexmccreight/xqtl-pipeline-new/output/cache/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.bed
INFO: Running LD pruning: LD prunning and remove related individuals (both ind of a pair) Plink2 has multi-threaded calculation for LD prunning
INFO: LD pruning is completed.
INFO: LD pruning output:   /Users/alexmccreight/xqtl-pipeline-new/output/cache/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.bed /Users/alexmccreight/xqtl-pipeline-new/output/cache/protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.in
INFO: Workflow qc (ID=w0f1d12c1e1c1d52a) is executed successfully with 2 completed steps.
```

In [None]:
sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/cache \
   --genoFile input/protocol_example.genotype.chr21_22.bed \
   --keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt \
   --name pQTL \
   --mac-filter 5 \
   --container singularity/bioinfo.sif \
   --mem 40G

```
INFO: Running basic QC filters: Filter SNPs and select individuals
INFO: basic QC filters is completed.
INFO: basic QC filters output:   /Users/alexmccreight/xqtl-pipeline-new/output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.bed
INFO: Running LD pruning: LD prunning and remove related individuals (both ind of a pair) Plink2 has multi-threaded calculation for LD prunning
INFO: LD pruning is completed.
INFO: LD pruning output:   /Users/alexmccreight/xqtl-pipeline-new/output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed /Users/alexmccreight/xqtl-pipeline-new/output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.in
INFO: Workflow qc (ID=wf360e1c84c1a8ec4) is executed successfully with 2 completed steps.
```

### Step 4: Run Principal Components Analysis on genotype
Timing: < 1 minute

In [None]:
sos run pipeline/PCA.ipynb flashpca \
   --cwd output/genotype_pca \
   --genoFile output/cache/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.bed \
   --container singularity/flashpcaR.sif \
   --mem 16G

```
INFO: Running flashpca_1: Run PCA analysis using flashpca
INFO: flashpca_1 is completed.
INFO: flashpca_1 output:   /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.rds
INFO: Running flashpca_2:
INFO: flashpca_2 is completed (pending nested workflow).
INFO: Running detect_outliers: Calculate Mahalanobis distance per population and report outliers
INFO: detect_outliers is completed.
INFO: detect_outliers output:   /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.mahalanobis.rds /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.outliers... (5 items)
INFO: flashpca_2 output:   /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.mahalanobis.rds /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.outliers... (5 items)
INFO: Running flashpca_3:
INFO: flashpca_3 is completed (pending nested workflow).
INFO: Running plot_pca: Plot PCA results. Can be used independently as "plot_pca" or combined with other workflow as eg "flashpca+plot_pca"
INFO: plot_pca is completed.
INFO: plot_pca output:   /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.pc.png /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.scree.png... (3 items)
INFO: flashpca_3 output:   /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.pc.png /Users/alexmccreight/xqtl-pipeline-new/output/genotype_pca/protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.scree.png
INFO: Workflow flashpca (ID=wb64a3efe8d5f81b8) is executed successfully with 5 completed steps.
```

## Command interface

In [None]:
sos run PCA.ipynb -h
sos run GWAS_QC.ipynb -h

In [None]:
usage: sos run PCA.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  pca_plink
  flashpca
  project_samples
  plot_pca
  detect_outliers

Global Workflow Options:
  --cwd output (as path)
                        the output directory for generated files
  --name ''
                        A string to identify your analysis run
  --pop-col ''
                        Name of the population column in the phenoFile
  --pops  (as list)
                        Name of the populations (from the population column) you
                        would like to plot and show on the PCA plot
  --label-col ''
                        Name of the color label column in the phenoFile; can be
                        the same as population column. Can also be a separate
                        column eg a "super population" column as a way to enable
                        you to combine selected populations based on another
                        column.
  --k 20 (as int)
                        Number of Principal Components to output,must be
                        consistant between flashpca run and project samples run
                        (flashpca partial PCA method).
  --maha-k 5 (as int)
                        Number of Principal Components based on which outliers
                        should be evaluated. Default is 5 but this should be
                        based on examine the scree plot
  --[no-]homogeneous (default to False)
                        Homogeneity of populations. Set to --homogeneous when
                        true and --no-homogeneous when false
  --container ''
                        Software container option
  --entrypoint ('micromamba run -a "" -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else ""

  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 10 (as int)
                        Number of threads

Sections
  pca_plink:            PCA command with PLINK, as a sanity check
    Workflow Options:
      --genoFile VAL (as path, required)
                        PLINK binary file
  flashpca_1:           Run PCA analysis using flashpca
    Workflow Options:
      --genoFile VAL (as path, required)
                        Plink binary file
      --phenoFile  path(f'{genoFile}'.replace(".bed",".fam"))

                        The phenotypic file
      --min-pop-size 2 (as int)
                        minimum population size to consider in the analysis
      --stand binom2
                        How to standardize X before PCA
  project_samples_1:    Project back to PCA model additional samples
    Workflow Options:
      --genoFile VAL (as path, required)
                        Plink binary file
      --phenoFile  path(f'{genoFile}'.replace(".bed",".fam"))

                        The phenotypic file
      --pca-model  f'{cwd}/{phenoFile:bn}{("."+name) if name else ""}.{(suffix+".") if suffix != "" else ""}pca.rds'

  plot_pca:             Plot PCA results. Can be used independently as
                        "plot_pca" or combined with other workflow as eg
                        "flashpca+plot_pca"
    Workflow Options:
      --outlier-file . (as path)
      --plot-data VAL (as path, required)
      --min-axis ''
      --max-axis ''
  detect_outliers:      Calculate Mahalanobis distance per population and report
                        outliers
    Workflow Options:
      --prob 0.997 (as float)
                        Set the probability to remove outliers eg 0.95 or 0.997
      --pval 0.05 (as float)
                        Mahalanobis distance p-value cutoff
      --[no-]robust (default to True)
                        Robust Mahalanobis to outliers
      --pca-result VAL (as path, required)
  flashpca_2, project_samples_2:
    Workflow Options:
      --prob 0.997 (as float)
                        Set the probability to remove outliers eg 0.95 or 0.997
      --[no-]robust (default to True)
                        Robust Mahalanobis to outliers
  flashpca_3, project_samples_3:

In [None]:
usage: sos run GWAS_QC.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  king
  qc_no_prune
  qc
  genotype_phenotype_sample_overlap

Global Workflow Options:
  --cwd output (as path)
                        the output directory for generated files
  --name ''
                        A string to identify your analysis run
  --genoFile  paths

                        PLINK binary files
  --remove-samples . (as path)
                        The path to the file that contains the list of samples
                        to remove (format FID, IID)
  --keep-samples . (as path)
                        The path to the file that contains the list of samples
                        to keep (format FID, IID)
  --keep-variants . (as path)
                        The path to the file that contains the list of variants
                        to keep
  --exclude-variants . (as path)
                        The path to the file that contains the list of variants
                        to exclude
  --kinship 0.0625 (as float)
                        Kinship coefficient threshold for related individuals
                        (e.g first degree above 0.25, second degree above 0.125,
                        third degree above 0.0625)
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container ''
                        Software container option
  --entrypoint ('micromamba run -a "" -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else ""


Sections
  king_1:               Inference of relationships in the sample to identify
                        closely related individuals
    Workflow Options:
      --kin-maf 0.01 (as float)
                        PLINK binary file
  king_2:               Select a list of unrelated individual with an attempt to
                        maximize the unrelated individuals selected from the
                        data
    Workflow Options:
      --[no-]maximize-unrelated (default to False)
                        If set to true, the unrelated individuals in a family
                        will be kept without being reported. Otherwise (use
                        `--no-maximize-unrelated`) the entire family will be
                        removed Note that attempting to maximize unrelated
                        individuals is computationally intensive on large data.
  king_3:               Split genotype data into related and unrelated samples,
                        if related individuals are detected
  qc_no_prune, qc_1:    Filter SNPs and select individuals
    Workflow Options:
      --maf-filter 0.0 (as float)
                        minimum MAF filter to use. 0 means do not apply this
                        filter.
      --maf-max-filter 0.0 (as float)
                        maximum MAF filter to use. 0 means do not apply this
                        filter.
      --mac-filter 0.0 (as float)
                        minimum MAC filter to use. 0 means do not apply this
                        filter.
      --mac-max-filter 0.0 (as float)
                        maximum MAC filter to use. 0 means do not apply this
                        filter.
      --geno-filter 0.1 (as float)
                        Maximum missingess per-variant
      --mind-filter 0.1 (as float)
                        Maximum missingness per-sample
      --hwe-filter 1e-15 (as float)
                        HWE filter -- a very lenient one
      --other-args  (as list)
                        Other PLINK arguments e.g snps_only, write-samples, etc
      --[no-]meta-only (default to False)
                        Only output SNP and sample list, rather than the PLINK
                        binary format of subset data
      --[no-]rm-dups (default to False)
                        Remove duplicate variants
  qc_2:                 LD prunning and remove related individuals (both ind of
                        a pair) Plink2 has multi-threaded calculation for LD
                        prunning
    Workflow Options:
      --window 50 (as int)
                        Window size
      --shift 10 (as int)
                        Shift window every 10 snps
      --r2 0.1 (as float)
  genotype_phenotype_sample_overlap: This workflow extracts overlapping samples
                        for genotype data with phenotype data, and output the
                        filtered sample genotype list as well as sample
                        phenotype list
    Workflow Options:
      --phenoFile VAL (as path, required)
                        A phenotype file, can be bed.gz or tsv
      --sample-participant-lookup . (as path)
                        If this file is provided, a genotype/phenotype sample
                        name match will be performed It must contain two column
                        names: genotype_id, sample_id

### Step 1: Sample match with genotype

In [None]:
# This workflow extracts overlapping samples for genotype data with phenotype data, and output the filtered sample genotype list as well as sample phenotype list
[genotype_phenotype_sample_overlap]
# A genotype fam file
parameter: genoFile = path
# A phenotype file, can be bed.gz or tsv
parameter: phenoFile = path
# If this file is provided, a genotype/phenotype sample name match will be performed
# It must contain two column names: genotype_id, sample_id
parameter: sample_participant_lookup = path(".")
input: genoFile, phenoFile
output: f'{cwd:a}/{path(_input[1]):bn}.sample_overlap.txt', f'{cwd:a}/{path(_input[1]):bn}.sample_genotypes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    # Load required libraries
    library(dplyr)
    library(readr)
    library(data.table)

    # Read data files; let read_delim auto-determine the delimiter
    genoFam <- fread(${_input[0]:ar}, header=FALSE)
    phenoFile <- read_delim(${_input[1]:ar}, col_names=TRUE)
    if (${"TRUE" if sample_participant_lookup.is_file() else "FALSE"}) {
        sample_lookup <- fread(${sample_participant_lookup:ar}, header=TRUE)
    } else {
        sample_lookup <- cbind(genoFam[,2], genoFam[,2])
        colnames(sample_lookup) <- c("genotype_id", "sample_id")
    }
    sample_lookup <- sample_lookup %>%
    filter(
        genotype_id %in% genoFam$V2,
        sample_id %in% colnames(phenoFile)
    )
    
    genoFam %>%
    filter(
        V2 %in% sample_lookup$genotype_id,
    ) %>%
    select(V1, V2) %>%
    fwrite(${_output[1]:r}, col.names=FALSE, sep="\t")

    sample_lookup %>%
    fwrite(${_output[0]:r}, sep="\t")

### Step 2: Kinship quality control

In [None]:
# Inference of relationships in the sample to identify closely related individuals
[king_1]
# PLINK binary file
parameter: kin_maf = 0.01
input: genoFile
output: f'{cwd}/{_input:bn}{("."+name) if name else ""}.kin0'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint=entrypoint
    plink2 \
      --bfile ${_input:n} \
      --make-king-table \
      --king-table-filter ${kinship} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      ${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
      --min-af ${kin_maf} \
      --max-af ${1-kin_maf} \
      --out ${_output:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} 
    
bash: expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint
    i="${_output}"
    output_size=$(ls -lh $i | cut -f 5 -d ' ')
    output_rows=$(zcat $i | wc -l | cut -f 1 -d ' ')
    output_column=$(zcat $i | head -1 | wc -w)
    output_preview=$(cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6)
    
    printf "output_info: %s\noutput_size: %s\noutput_rows: %s\noutput_column: %s\noutput_preview:\n%s\n" \
        "$i" "$output_size" "$output_rows" "$output_column" "$output_preview"

In [None]:
# Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data 
[king_2]
# If set to true, the unrelated individuals in a family will be kept without being reported. 
# Otherwise (use `--no-maximize-unrelated`) the entire family will be removed
# Note that attempting to maximize unrelated individuals is computationally intensive on large data.
parameter: maximize_unrelated = False
related_id = [x.strip() for x in open(_input).readlines() if not x.startswith("#")]
done_if(len(related_id) == 0, msg = f"No related individuals detected from {_input}.")
output: f'{_input:n}.related_id'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R:  container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint=entrypoint
    library(dplyr)
    library(igraph)
    # Remove related individuals while keeping maximum number of individuals
    # this function is simplified from: 
    # https://rdrr.io/cran/plinkQC/src/R/utils.R
    #' @param relatedness [data.frame] containing pair-wise relatedness estimates
    #' (in column [relatednessRelatedness]) for individual 1 (in column
    #' [relatednessIID1] and individual 2 (in column [relatednessIID1]). Columns
    #' relatednessIID1, relatednessIID2 and relatednessRelatedness have to present,
    #' while additional columns such as family IDs can be present. Default column
    #' names correspond to column names in output of plink --genome
    #' (\url{https://www.cog-genomics.org/plink/1.9/ibd}). All original
    #' columns for pair-wise highIBDTh fails will be returned in fail_IBD.
    #' @param relatednessTh [double] Threshold for filtering related individuals.
    #' Individuals, whose pair-wise relatedness estimates are greater than this
    #' threshold are considered related.
    relatednessFilter <- function(relatedness, 
                                  relatednessTh,
                                  relatednessIID1="IID1", 
                                  relatednessIID2="IID2",
                                  relatednessRelatedness="KINSHIP") {
        # format data
        if (!(relatednessIID1 %in% names(relatedness))) {
            stop(paste("Column", relatednessIID1, "for relatedness not found!"))
        }
        if (!(relatednessIID2 %in% names(relatedness))) {
            stop(paste("Column", relatednessIID1, "for relatedness not found!"))
        }
        if (!(relatednessRelatedness %in% names(relatedness))) {
            stop(paste("Column", relatednessRelatedness,
                       "for relatedness not found!"))
        }

        iid1_index <- which(colnames(relatedness) == relatednessIID1)
        iid2_index <- which(colnames(relatedness) == relatednessIID2)

        relatedness[,iid1_index] <- as.character(relatedness[,iid1_index])
        relatedness[,iid2_index] <- as.character(relatedness[,iid2_index])

        relatedness_names <- names(relatedness)
        names(relatedness)[iid1_index] <- "IID1"
        names(relatedness)[iid2_index] <- "IID2"
        names(relatedness)[names(relatedness) == relatednessRelatedness] <- "M"

        # Remove symmetric IID rows
        relatedness_original <- relatedness
        relatedness <- dplyr::select_(relatedness, ~IID1, ~IID2, ~M)

        sortedIDs <- data.frame(t(apply(relatedness, 1, function(pair) {
            c(sort(c(pair[1], pair[2])))
            })), stringsAsFactors=FALSE)
        keepIndex <- which(!duplicated(sortedIDs))

        relatedness_original <- relatedness_original[keepIndex,]
        relatedness <- relatedness[keepIndex,]

        # individuals with at least one pair-wise comparison > relatednessTh
        # return NULL to failIDs if no one fails the relatedness check
        highRelated <- dplyr::filter_(relatedness, ~M > relatednessTh)
        if (nrow(highRelated) == 0) {
            return(list(relatednessFails=NULL, failIDs=NULL))
        }

        # all samples with related individuals
        allRelated <- c(highRelated$IID1, highRelated$IID2)
        uniqueIIDs <- unique(allRelated)

        # Further selection of samples with relatives in cohort
        multipleRelative <- unique(allRelated[duplicated(allRelated)])
        singleRelative <- uniqueIIDs[!uniqueIIDs %in% multipleRelative]

        highRelatedMultiple <- highRelated[highRelated$IID1 %in% multipleRelative |
                                            highRelated$IID2 %in% multipleRelative,]
        highRelatedSingle <- highRelated[highRelated$IID1 %in% singleRelative &
                                           highRelated$IID2 %in% singleRelative,]

        # Only one related samples per individual
        if(length(singleRelative) != 0) {
          # randomly choose one to exclude
          failIDs_single <- highRelatedSingle[,1]
            
        } else {
          failIDs_single <- NULL
        }
  
        # An individual has multiple relatives
        if(length(multipleRelative) != 0) {
            relatedPerID <- lapply(multipleRelative, function(x) {
                tmp <- highRelatedMultiple[rowSums(
                    cbind(highRelatedMultiple$IID1 %in% x,
                          highRelatedMultiple$IID2 %in% x)) != 0,1:2]
                rel <- unique(unlist(tmp))
                return(rel)
            })
            names(relatedPerID) <- multipleRelative

            keepIDs_multiple <- lapply(relatedPerID, function(x) {
                pairwise <- t(combn(x, 2))
                index <- (highRelatedMultiple$IID1 %in% pairwise[,1] &
                              highRelatedMultiple$IID2 %in% pairwise[,2]) |
                    (highRelatedMultiple$IID1 %in% pairwise[,2] &
                         highRelatedMultiple$IID2 %in% pairwise[,1])
                combination <- highRelatedMultiple[index,]
                combination_graph <- igraph::graph_from_data_frame(combination,
                                                                   directed=FALSE)
                all_iv_set <- igraph::ivs(combination_graph)
                length_iv_set <- sapply(all_iv_set, function(x) length(x))

                if (all(length_iv_set == 1)) {
                    # check how often they occurr elsewhere
                    occurrence <- sapply(x, function(id) {
                        sum(sapply(relatedPerID, function(idlist) id %in% idlist))
                    })
                    # if occurrence the same everywhere, pick the first, else keep
                    # the one with minimum occurrence elsewhere
                    if (length(unique(occurrence)) == 1) {
                        nonRelated <- sort(x)[1]
                    } else {
                        nonRelated <- names(occurrence)[which.min(occurrence)]
                    }
                } else {
                    nonRelated <- all_iv_set[which.max(length_iv_set)]
                }
                return(nonRelated)
            })
            keepIDs_multiple <- unique(unlist(keepIDs_multiple))
            failIDs_multiple <- c(multipleRelative[!multipleRelative %in%
                                                       keepIDs_multiple])
        } else {
            failIDs_multiple <- NULL
        }
        allFailIIDs <- c(failIDs_single, failIDs_multiple)
        relatednessFails <- lapply(allFailIIDs, function(id) {
            fail_inorder <- relatedness_original$IID1 == id &
                relatedness_original$M > relatednessTh
            fail_inreverse <- relatedness_original$IID2 == id &
                relatedness_original$M > relatednessTh
            if (any(fail_inreverse)) {
                inreverse <- relatedness_original[fail_inreverse, ]
                id1 <- iid1_index
                id2 <- iid2_index
                inreverse[,c(id1, id2)] <- inreverse[,c(id2, id1)]
                names(inreverse) <- relatedness_names
            } else {
                inreverse <- NULL
            }
            inorder <- relatedness_original[fail_inorder, ]
            names(inorder) <- relatedness_names
            return(rbind(inorder, inreverse))
        })
        relatednessFails <- do.call(rbind, relatednessFails)
        if (nrow(relatednessFails) == 0) {
            relatednessFails <- NULL
            failIDs <- NULL
        } else {
            names(relatednessFails) <- relatedness_names
            rownames(relatednessFails) <- 1:nrow(relatednessFails)
            uniqueFails <- relatednessFails[!duplicated(relatednessFails[,iid1_index]),]
            failIDs <- uniqueFails[,iid1_index]
        }
        return(list(relatednessFails=relatednessFails, failIDs=failIDs))
    }
    
  
    # main code
    kin0 <- read.table(${_input:r}, header=F, stringsAsFactor=F)
    colnames(kin0) <- c("FID1","ID1","FID2","ID2","NSNP","HETHET","IBS0","KINSHIP")
    if (${"TRUE" if maximize_unrelated else "FALSE"}) {
        rel <- relatednessFilter(kin0, ${kinship}, "ID1", "ID2", "KINSHIP")$failIDs
        tmp1 <- kin0[,1:2]
        tmp2 <- kin0[,3:4]
        colnames(tmp1) = colnames(tmp2) = c("FID", "ID")
        # Get the family ID for these rels so there are two columns FID and IID in the output
        lookup <- dplyr::distinct(rbind(tmp1,tmp2))
        dat <- lookup[which(lookup[,2] %in% rel),]
    } else {
        rel <- kin0 %>% filter(KINSHIP >= ${kinship})
        dat = rbind(rel[,c("FID1","ID1")],setNames(rel[,c("FID2","ID2")],c("FID1","ID1")))
        dat = dat[!duplicated(dat),] ## This is to remove duplicated FID and IID caused by one sample being related to multiple samples
       }    

    cat("There are", nrow(dat),"related individuals using a kinship threshold of ${kinship}\n")
    write.table(dat,${_output:r}, quote=FALSE, row.names=FALSE, col.names=FALSE)
    
bash: expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint
    i="${_output}"
    output_size=$(ls -lh $i | cut -f 5 -d ' ')
    output_rows=$(zcat $i | wc -l | cut -f 1 -d ' ')
    output_column=$(zcat $i | head -1 | wc -w)
    output_preview=$(cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6)
    
    printf "output_info: %s\noutput_size: %s\noutput_rows: %s\noutput_column: %s\noutput_preview:\n%s\n" \
        "$i" "$output_size" "$output_rows" "$output_column" "$output_preview"

In [None]:
# Split genotype data into related and unrelated samples, if related individuals are detected
[king_3]
input: output_from(2), genoFile
output: unrelated_bed = f'{cwd}/{_input[0]:bn}.unrelated.bed',
        related_bed = f'{cwd}/{_input[0]:bn}.related.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash:  expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    plink2 \
      --bfile ${_input[1]:n} \
      --remove ${_input[0]} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      --make-bed \
      --out ${_output[0]:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a 

    plink2 \
      --bfile ${_input[1]:n} \
      --keep ${_input[0]} \
      --make-bed \
      --out ${_output[1]:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a 
        
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    for i in ${_output}; do     
        output_size=$(ls -lh $i | cut -f 5 -d ' ')
        printf "output_info: %s\noutput_size: %s\n" "$i" "$output_size" >> ${_output[0]:n}.stdout
    done

### Step 3: Prepare unrelated individuals data for PCA

In [None]:
# Filter SNPs and select individuals 
[qc_no_prune, qc_1 (basic QC filters)]
# minimum MAF filter to use. 0 means do not apply this filter.
parameter: maf_filter = 0.0
# maximum MAF filter to use. 0 means do not apply this filter.
parameter: maf_max_filter = 0.0
# minimum MAC filter to use. 0 means do not apply this filter.
parameter: mac_filter = 0.0
# maximum MAC filter to use. 0 means do not apply this filter.
parameter: mac_max_filter = 0.0 
# Maximum missingess per-variant
parameter: geno_filter = 0.1
# Maximum missingness per-sample
parameter: mind_filter = 0.1
# HWE filter -- a very lenient one
parameter: hwe_filter = 1e-15
# Other PLINK arguments e.g snps_only, write-samples, etc
parameter: other_args = []
# Only output SNP and sample list, rather than the PLINK binary format of subset data
parameter: meta_only = False
# Remove duplicate variants
parameter: rm_dups = False

fail_if(not (keep_samples.is_file() or keep_samples == path('.')), msg = f'Cannot find ``{keep_samples}``')
fail_if(not (keep_variants.is_file() or keep_variants == path('.')), msg = f'Cannot find ``{keep_variants}``')
fail_if(not (remove_samples.is_file() or remove_samples == path('.')), msg = f'Cannot find ``{remove_samples}``')

input: genoFile, group_by=1
output: f'{cwd}/{_input:bn}{("."+name) if name else ""}.plink_qc{".extracted" if keep_variants.is_file() else ""}{".bed" if not meta_only else ".snplist"}'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    plink2 \
      --bfile ${_input:n} \
      ${('--maf %s' % maf_filter) if maf_filter > 0 else ''} \
      ${('--max-maf %s' % maf_max_filter) if maf_max_filter > 0 else ''} \
      ${('--mac %s' % mac_filter) if mac_filter > 0 else ''} \
      ${('--max-mac %s' % mac_max_filter) if mac_max_filter > 0 else ''} \
      ${('--geno %s' % geno_filter) if geno_filter > 0 else ''} \
      ${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} \
      ${('--mind %s' % mind_filter) if mind_filter > 0 else ''} \
      ${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
      ${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
      ${('--exclude %s' % exclude_variants) if exclude_variants.is_file() else ""} \
      ${('--extract %s' % keep_variants) if keep_variants.is_file() else ""} \
      ${('--make-bed') if not meta_only else "--write-snplist --write-samples"} \
      ${("") if not rm_dups else "--rm-dup force-first 'list'"} \
      ${paths(["--%s" % x for x in other_args]) if other_args else ""} \
      --out ${_output:n} \
      --threads ${numThreads} \
      --memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a 
        
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
    i="${_output}"
    output_size=$(ls -lh $i | cut -f 5 -d ' ')
    printf "output_info: %s\noutput_size: %s\n" "$i" "$output_size" >> ${_output:n}.stdout

In [None]:
# LD prunning and remove related individuals (both ind of a pair)
# Plink2 has multi-threaded calculation for LD prunning
[qc_2 (LD pruning)]
# Window size
parameter: window = 50
# Shift window every 10 snps
parameter: shift = 10
parameter: r2 = 0.1
stop_if(r2==0)
output: bed=f'{cwd}/{_input:bn}.prune.bed', prune=f'{cwd}/{_input:bn}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    plink2 \
    --bfile ${_input:n} \
    --indep-pairwise ${window} ${shift} ${r2}  \
    --out ${_output["prune"]:nn} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)/1e6}
   
    plink2 \
    --bfile ${_input:n} \
    --extract ${_output['prune']} \
    --make-bed \
    --out ${_output['bed']:n} \
    --threads ${numThreads} \
    --memory ${int(expand_size(mem) * 0.9)/1e6}
    
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    i="${_output[0]}"
    output_size=$(ls -lh $i | cut -f 5 -d ' ')
    printf "output_info: %s\noutput_size: %s\n" "$i" "$output_size" >> ${_output[0]:n}.stdout
    i="${_output[1]}"
    output_size=$(ls -lh $i | cut -f 5 -d ' ')
    output_rows=$(zcat $i | wc -l | cut -f 1 -d ' ')
    output_column=$(zcat $i | head -1 | wc -w)
    output_preview=$(cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6) 
    printf "output_info: %s\noutput_size: %s\noutput_rows: %s\noutput_column: %s\noutput_preview:\n%s\n" \
        "$i" "$output_size" "$output_rows" "$output_column" "$output_preview" >> ${_output[1]}.stdout

### Step 4: Run Principal Components Analysis on genotype

In [None]:
# Run PCA analysis using flashpca 
[flashpca_1]
# Plink binary file
parameter: genoFile = path
# The phenotypic file
parameter: phenoFile = path(f'{genoFile}'.replace(".bed",".fam"))
# minimum population size to consider in the analysis
parameter: min_pop_size = 2
# How to standardize X before PCA
parameter: stand = "binom2"
## Input genoFile here is for unrelated samples
input: genoFile, phenoFile
output: f'{cwd}/{phenoFile:bn}{("."+name) if name else ""}.{(suffix+".") if suffix != "" else ""}pca.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container = container, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    # Load required libraries
    library(flashpcaR)
    library(dplyr)
    pops = c(${paths(pops):r,})
    f <- flashpca(${_input[0]:nr}, ndim=${k}, stand="${stand}", do_loadings=TRUE, check_geno=TRUE)
    rownames(f$loadings) <- read.table('${_input[0]:n}.bim',stringsAsFactors =F)[,2]
    # Use the projection file to generate pca plot
    pca <- as.data.frame(f$projection)
    pca <- tibble::rownames_to_column(pca, "ID")
    colnames(pca) <- c("ID",paste0("PC", 1:${k}))
  
    # Read fam file with phenotypes
    if(stringr::str_detect(${_input[1]:r},".fam$")){
        pheno <- read.table(${_input[1]:r}, header=F,stringsAsFactors =F)
        colnames(pheno) = c("FID", "IID", "MID", "PID", "SEX", "STATUS")
    } else {
        pheno <- read.table(${_input[1]:r}, header=T,stringsAsFactors =F)
        if("IID" %in% colnames(pheno) == FALSE) stop("No IID column in the phenoFile. Please rename the header of the phenoFile")
        if("FID" %in% colnames(pheno) == FALSE) pheno$FID = pheno$IID
    }
    # Make the unique ID by merge FID and IID
    pheno$ID = paste(pheno$FID,pheno$IID,sep = ":")
    #check duplicated ID
    if(length(unique(pheno$ID))!=length(pheno$ID)) stop("There are duplicated names in IID column of phenoFile")

    if (length(pops)>0) pheno <- pheno %>%filter(${pop_col if pop_col else  "pop"} %in% pops | ${label_col if label_col else  "pop"} %in% pops)
    pca <-merge(pheno, pca,by ="ID", all=FALSE) 
    #
    if (${"TRUE" if pop_col else "FALSE"}) {
        # remove populations have less than ${min_pop_size} samples
        pop<-names(table(pca$${pop_col if pop_col else "pop"}))
        pop_filter<-pop[table(pca$pop)<${min_pop_size}] # pop to be removed
        if (length(pop_filter)>0) {
            warning(for (i in pop_filter){cat(i,';')},'these ', length(pop_filter)," population will be removed due to having less than ${min_pop_size} samples in data.")
            # remove
            pca<-pca%>% filter(${f'!{pop_col}%in%pop_filter' if pop_col else pop_col})
        }
    } else {
      pca$pop <- 1 
    }

    # Write the PC scores to a file
    write.table(pca,"${_output:n}.txt", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
    dat = list(pca_model = f, pc_scores = pca, meta = "${_input[1]:bn} ${suffix}")
    # compute centroids before projecting back the samples
    # (calculate mean/median/cov per pop)
    if(${"FALSE" if homogeneous else "TRUE"}){
        pop_group <- split(dat$pc_scores[ ,c(paste0("PC", 1:${maha_k}))], list(Group = dat$pc_scores$${pop_col if pop_col else "pop"}))
        dat$pc_cov <- lapply(pop_group, function(x) cov(x))
        dat$pc_mean <- lapply(pop_group, function(x) sapply(x, mean))
        dat$pc_median <- lapply(pop_group, function(x) sapply(x, median))
    } else {
        dat$pc_cov <- cov(f$projection[,1:${maha_k}])
        dat$pc_mean <- apply(f$projection[,1:${maha_k}], 2, mean)
        dat$pc_median <- apply(f$projection[,1:${maha_k}], 2, median)
    }
  
    # save results
    saveRDS(dat, ${_output:r})
  
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "This rds file is a list containing the pca for unrelated sample" >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        done

In [None]:
[flashpca_2, project_samples_2]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Robust Mahalanobis to outliers
parameter: robust = True
output: distance=f'{_input:n}.mahalanobis.rds',
        identified_outliers=f'{_input:n}.outliers',
        analysis_summary=f'{_input:n}.analysis_summary.md',
        qqplot_mahalanobis=f'{_input:n}.mahalanobis_qq.png',
        hist_mahalanobis=f'{_input:n}.mahalanobis_hist.png'
sos_run("detect_outliers", pca_result=_input, prob=prob, robust=robust)

In [None]:
[flashpca_3, project_samples_3]
input: output_from(1), output_from(2)['identified_outliers']
outliers = [x.strip() for x in open(_input[1]).readlines() if x.strip()]
output: f"{cwd}/{_input[0]:bn}.pc.png",
        f"{cwd}/{_input[0]:bn}.scree.png"
sos_run("plot_pca", plot_data = _input[0], outlier_file = _input[1] if len(outliers) else path())