# Principal component analysis

The intention of this notebook is to perform the PCA analysis on genotype data and generate plots.

## Overview


Steps to generate a PCA include 

- removing related individuals
- pruning variants in linkage disequilibrium (LD)
- perform PCA analysis on genotype
- excluding outlier samples in the PCA space for individuals of homogeneous self-reported ancestry. These outliers may suggest poor genotyping quality or distant relatedness.

Pitfalls

1. Some of the PCs may capture LD structure rather than population structure (decrease in power to detect associations in these regions of high LD)
2. When projecting a new study dataset to the PCA space computed from a reference dataset: projected PCs are shrunk toward 0 in the new dataset
3. PC scores may capture outliers that are due to family structure, population structure or other reasons; it might be beneficial to detect and remove these individuals to maximize the population structure captured by PCA (in the case of removing a few outliers) or to restrict analyses to genetically homogeneous samples

## Input

1. Genotype in PLINK format: could be different from bfile in the case of using for e.g. exome data / imputed genotype data
2. Phentype files with population information and possibly disease or other labelling information, must have arbitrary column named 'FID' and 'IID'.

The inputs should be splitted into sets of related and unrelated individuals. Additionally you may want to prepare data per population. See "How to run this workflow" section for more details.

## Output

1. PCA models (inside RDS file)
2. PCA scores (inside RDS file)
3. Mahalanobis distances and outliers to remove

## General workflow

1. Estimate relatedness of the individuals in the sample by PLINK 2 that implements the KING algorithm
2. Select specific SNPs and samples using PLINK (QC maf>1%, geno missing rate (`geno`) 0.1, individual missing rate (`mind`) =0.1 and `hwe`=5e-08), and remove related individuals
3. SNPs thining by doing LD-pruning (window=50, shift=10, r2=0.1 are the defaults)
4. Run PCA using only unrelated individuals for all populations, and examine the resulting plot
5. Project back related individuals, and generate a list of suggested samples to remove based on Mahalanobis distance test statistic per population. Default criteria is 0.997 percentile (two sided) but we recommend checking the output plot before and after removal and rethink about it.
6. Remove outliers based on list previously generated

Now the data should not have outliers. If you have subpopulations in the data, then additional steps should be applied for:

7. Split data into different populations, each population data should have both related vs unrelated individual data-sets
8. For each population, perform QC
9. For each population, re-calculate per population PC's for unrelated individuals
10. For each population, project related samples back to the PC space

No need to remove outliers at this point.

## Method

Here is a quick recap of PCA analysis for those not immediately familiar with the method. PCA is a mathematical method to reduce dimensionality of the data while retaining most of the variation in the dataset. 
This is accomplished by identifying directions or Principal Components (PC's) that account for the maximum variation in the data. 

One common approach to PCA is based on the singular-value decomposition of the the data matrix $X$ (in our case the genotype matrix),
$$X = U D V^T,$$
where $U$ are the left eigenvectors, $D$ is the diagonal matrix of singular values, and $V$ are the right eigenvectors (also called loadings). 

PCA can also be done using the eigen-decomposition of $X X^T$
$$X X^T = U S U^T,$$ where $S=D^2$ is the diagonal matrix of eigenvalues.
$X$ is usually centred (mean-subtracted) or standardised (mean subtracted, then divided by standard deviation) before PCA.

For PCA of SNP genotypes (at least in diploid organisms), the common standardisation is
$$X_{ij}^{\prime} = \frac{X_{ij} - 2p_j}{\sqrt{2 p_j (1 - p_j)}},$$
where $X_{ij}$ is the genotype (minor allele dosage $\{0, 1, 2\}$) for the $i$th individual and the $j$th SNP, and $p_j$ is the minor allele frequency (MAF) for the $j$th SNP. In addition, the eigenvalues are scaled by the number of SNPs $m$ (equivalent to performing the eigen-decomposition of $XX^T/m$).


## How to run this workflow



### Step 1: Estimate kinship in the sample
Aim: identify and output closely related individuals prior to PCA analysis.

```
sos run GWAS_QC.ipynb king \
    --genoFile <all_samples.bed>
```

### Step 2 and 3: Sample selection and QC the genetic data 

1. QC based on MAF, sample and variant missigness and Hardy-Weinberg Equilibrium. You can provide a list of samples to keep, or to remove. For example:
    - Only extract data for one population
    - Only extract data for related individuals
    - Only extract data for unrelated individuals
   In current context we would like to extract data for unrelated individuals and proceed with the rest of the QC steps.
2. LD pruning. Prune SNPs in linkage dissequilibrium to make sure the PCA actually captures population structure and not LD structure (which could reduce the power of detecting genetic associations in these LD-regions).

Get unrelated individuals

```
sos run GWAS_QC.ipynb qc \
    --genoFile <all_samples.bed> \
    --remove-samples <king_output.related_id> \
    ...
```

Get related individuals but **use the same variants as before**. We therefore only run `qc:1` with a list of variants extracted, and other filtering parameters set to 0.

```
sos run GWAS_QC.ipynb qc:1 \
    --genoFile <all_samples.bed> \
    --keep-samples <king_output.related_id> \
    --keep-variants <previous_command_output.prune.in> \
    --maf-filter 0 --geno-filter 0 --mind-filter 0 --hwe-filter 0 \
    ...
```

### Step 4: PCA analysis for all samples

```
sos run PCA.ipynb flashpca \
    --genoFile <unrelated_samples.bed>
    --phenoFile <phenotypes.txt>
```

### Step 5: Projection of related individuals and outlier detection 

```
sos run PCA.ipynb project_samples \
    --pca-model <unrelated_samples.pca.rds> \
    --genoFile <related_samples.bed> \
    --phenoFile <phenotypes.txt>
```

### Step 6: Outlier removal

```
sos run GWAS_QC.ipynb qc \
    --remove-samples <mahalanobi_output.outliers> \
    --genoFile <all_samples.bed>
```

If you are analyzing a homogenous population then this is the end.

### Step 7: Split data by population

For each population, you can still use the GWAS_QC.ipynb and PCA.ipynb workflows, except that at this step you have to work manually on the `--remove-samples` and `--keep-samples` files.

Please do this manually now to create `remove-samples` (of related and not in the current pop) and `keep-samples` (of related and in the current pop) for next steps.

### Step 8: For each population perform QC

Get unrelated individuals

```
sos run GWAS_QC.ipynb qc \
    --genoFile <all_samples.bed> \
    --remove-samples  <king_output.related_id AND samples not in this population> \
    ...
```

Get related individuals but **use the same variants as before**. We therefore only run `qc:1` with a list of variants extracted, and other filtering parameters set to 0.

```
sos run GWAS_QC.ipynb qc:1 \
    --genoFile <all_samples.bed> \
    --keep-samples <king_output.related_id AND samples within this population> \
    --keep-variants <previous_command_output.prune.in> \
    --maf-filter 0 --geno-filter 0 --mind-filter 0 --hwe-filter 0 \
    ...
```

### Step 9: For each population do PCA

```
sos run PCA.ipynb flashpca \
    --genoFile <unrelated_samples_this_pop.bed> \
    --phenoFile <phenotypes.txt> \
    ...
```

### Step 10: For each population projection of related individuals, no need for outlier detection

Therefore we run `project_samples:1` only the first step.

```
sos run PCA.ipynb project_samples:1 \
    --pca-model <unrelated_samples_this_pop.pca.rds> \
    --genoFile <related_samples_this_pop.bed> \
    --phenoFile <phenotypes.txt> \
    ...
```

## How to run MWE: 

running on the columbia cluster, using data of MWE_AD

Step 1 Estimate kinship in the sample

```
sos run ~/GWAS_QC.ipynb king \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/output \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --name first \
    --kinship 0.13
```

Step 2 Get the genotypes (make bed)  for unrelated individuals

```
sos run ~/GWAS_QC.ipynb qc\
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/output \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --remove_samples ~/output/rename_chr22.first.related_id\
    --maf_filter 0.5 \
    --geno_filter 0.2 \
    --mind_filter 0.1 \
    --hwe_filter 0.0 \
    --name unrelated \
    --window 50 \
    --shift 10 \
    --r2 0.5 
```
Step 3
Get the genotype file (make bed) for related individuals

```
sos run  ~/GWAS_QC.ipynb qc:1 \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/output \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --keep-samples ~/output/rename_chr22.first.related_id\
    --keep-variants ~/output/cache/rename_chr22.unrelated.filtered.prune.in\
    --maf-filter 0 --geno-filter 0 --mind-filter 0 --hwe-filter 0\
    --name related
```
Step 4: Run PCA for unrelated individuals, and plot PCA plots (flash_pca_2 will call plot_pca to plot, no need to do flashpca+plot_pca)

```
sos run  ~/PCA.ipynb flashpca \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/output \
    --genoFile ~/output/cache/rename_chr22.unrelated.filtered.prune.bed\
    --phenoFile ~/MWE_AD/MWE_pheno.txt\
    --k 4 \
    --label_col RACE \
    --pop_col RACE \
    --plot_data ~/output/MWE_pheno.pca.rds
```

Step 5: 
Project back related individuals, detect & generate outlier for the entire dataset, and plot PCA plots without showing outliers

```
sos run  ~/PCA.ipynb project_samples+plot_pca \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/output \
    --genoFile ~/output/cache/rename_chr22.related.filtered.bed \
    --phenoFile ~/MWE_AD/MWE_pheno.txt\
    --label_col RACE \
    --plot_K 3\
    --pca_model ~/output/MWE_pheno.pca.rds \
    --pop_col RACE \
    --plot_data ~/output/MWE_pheno.pca.rds
```

Step 6: 
Outlier removal for the entire dataset
```
sos run ~/GWAS_QC.ipynb qc:1 \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/Downloads/output \
    --remove-samples ~/MWE_AD/MWE_pheno.projected.outliers \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --name no_outliers
```

Plot PCA with outliers circled in red
```
sos run ~/PCA.ipynb plot_pca \
    --container_lmm /mnt/mfs/statgen/containers/lmm.sif \
    --cwd ~/output \
    --outlier_file ~/MWE_AD/MWE_pheno.projected.outliers \
    --genoFile ~/MWE_AD/rename_chr22.bed \
    --phenoFile ~/MWE_AD/MWE_pheno.txt\
    --k 4 \
    --plot_K 3\
    --label_col RACE \
    --pop_col RACE \
    --plot_data ~/output/MWE_pheno.pca.rds
```

## Command interface

In [2]:
sos run PCA.ipynb -h

[91mERROR[0m: [91mNotebook JSON is invalid: %s[0m
usage: sos run PCA.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  flashpca
  project_samples
  plot_pca
  detect_outliers

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --genoFile VAL (as path, required)
                        Plink binary file
  --phenoFile VAL (as path, required)
                        The phenotypic file
  --pop-col ''
                        Name of the population column in the phenoFile
  --pops  (as list)
  --label-col VAL (as str, required)
                        Name of the color label column in the phenoFile; can be
   

In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# Plink binary file
parameter: genoFile = path
# The phenotypic file
parameter: phenoFile = path
# Name of the population column in the phenoFile
parameter: pop_col = ""
parameter: pops = []
# Name of the color label column in the phenoFile; can be the same as population column
parameter: label_col = str
# Homogeneity of populations 
parameter: homogeneous = False
# Software container option
parameter: container_lmm = 'statisticalgenetics/lmm:1.8'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 10
# Specify which step it is, to differentiate the RDS and pca PLOT outputs for different steps 
parameter: name= str
suffix = '_'.join(pops)

## PCA analysis

In [None]:
# Run PCA analysis using flashpca 
[flashpca_1]
# Number of Principal Components to output. Default is 10
parameter: k = 10
# How to standardize X before PCA
parameter: stand = "binom2"
input: genoFile, phenoFile
output: f'{cwd}/{phenoFile:bn}.{(suffix+".") if suffix != "" else ""}pca.{name}.rds'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container = container_lmm, expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    # Load required libraries
    library(flashpcaR) 
    f <- flashpca(${_input[0]:nr}, ndim=${k}, stand="${stand}", do_loadings=TRUE, check_geno=TRUE)
    # Use the projection file to generate pca plot
    pca <- as.data.frame(f$projection)
    pca <- tibble::rownames_to_column(pca, "ID")
    colnames(pca) <- c("ID",paste0("PC", 1:${k}))
    pca$IID <- sapply(strsplit(as.character(pca$ID),':'), "[", 1)
    # Read fam file with phenotypes
    pheno <- read.table(${_input[1]:r}, sep="\t", header=T )
    pca <-merge(pheno, pca, by="IID", all=FALSE) 
    
    #Calculate mean/median/cov per pop
    if(${"FALSE" if homogeneous else "TRUE"}){
        pop_group <- split(pca[ ,c(paste0("PC", 1:${k}))], list(Group = pca$${pop_col}))
        pc_cov <- lapply(pop_group, function(x) cov(x))
        pc_mean <- lapply(pop_group, function(x) sapply(x, mean))
        pc_median <- lapply(pop_group, function(x) sapply(x, median))
    } else {
        pc_cov <- cov(f$projection)
        pc_mean <- apply(f$projection, 2, mean)
        pc_median <- apply(f$projection, 2, median)
    }
    # Write the PC scores to a file
    write.table(pca,"${_output:n}.txt", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
    # save results
    saveRDS(list(pca_model = f, pc_scores = pca, pc_mean = pc_mean, pc_median = pc_median, pc_cov = pc_cov, k = ${k}, meta = "${_input[1]:bn} ${suffix}"), ${_output:r})

[flashpca_2]
parameter: outlier_file = path()                            
output: f'{_input:nn}.{name}.pc.png',
        f'{_input:nn}.{name}.scree.png'
sos_run("plot_pca", pca_model = _input, outlier_file = outlier_file)

## Project related individuals back

In [None]:
# Project back to PCA model additional samples
[project_samples_1]
# Here genoFile is samples to be projected back
parameter: pca_model = path
input: genoFile, phenoFile, pca_model
output: f'{_input[1]:n}.{name}.projected.rds'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container=container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # Load required libraries
    library(dplyr)
    library(flashpcaR)
    # Read the PLINK binary files
    frel <- ${_input[0]:nr}
    # Read loadings, center and scale from previous PCA
    dat <- readRDS(${_input[2]:r})
    f <- dat$pca_model
    # Read the bim file to obtain reference alleles
    bim <- read.table('${_input[0]:n}.bim')
    ref <- as.character(bim[,5])
    names(ref) <- bim[,2]
    # Project related samples
    fpro <- project(frel, loadings=f$loadings, orig_mean=f$center, orig_sd=f$scale, ref_allele=ref)
    pca <- fpro$projection
    k = dat$k
    pca <- as.data.frame(fpro$projection)
    pca <- tibble::rownames_to_column(pca, "ID")
    colnames(pca) <- c("ID",paste0("PC", 1:k))
    pca$IID <- sapply(strsplit(as.character(pca$ID),':'), "[", 1)
    # Read fam file with phenotypes
    pheno <- read.table(${_input[1]:r}, sep="\t", header=T )
    pca <-merge(pheno, pca, by="IID", all=FALSE) 
    dat$pc_scores = rbind(dat$pc_scores, pca)
    
    # Write the PC scores to a file
    write.table(dat$pc_scores,"${_output:n}.txt", sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
    # save results
    saveRDS(dat, ${_output:r})
  
[project_samples_2]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Robust Mahalanobis to outliers
parameter: robust = True
output: distance=f'{_input:n}.{name}.mahalanobis.rds',
        removed_outliers=f'{_input:n}.{name}.outliers',
        analysis_summary=f'{_input:n}.{name}.analysis_summary.md',
        qqplot_mahalanobis=f'{_input:n}.{name}.mahalanobis_qq.png',
        hist_mahalanobis=f'{_input:n}.{name}.mahalanobis_hist.png'
sos_run("detect_outliers", pca_result=_input, prob=prob, robust=robust)

## Plot PCA results


In [None]:
# Plot PCA results. Can be used:
# independently as "plot_pca" or conbinded with other workflow as "project_samples+plot_pca"
[plot_pca]
parameter: outlier_file = path()
parameter: plot_data = path
parameter: plot_K = int
input: plot_data
output: f'{_input:nn}.{name}.pc.png',
        f'{_input:nn}.{name}.scree.png'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
R: container = container_lmm,volumes = [f"{outlier_file:ad}:{outlier_file:ad}"], expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'   
    library(dplyr)
    library(ggplot2)
    library(gridExtra)
    library(matrixStats)
    dat = readRDS(${_input:r})
    f = dat$pca_model
    pca_final = dat$pc_scores
    pca_final<-pca_final%>%
        mutate(${label_col}=as.character(${label_col}))
    k = dat$k
    
    ###
    # Make the plots
    ###
    # Get the min and max values for x and y-axes 
    min_axis <- round(colMins(as.matrix(f$projection[sapply(f$projection, is.numeric)])),1)
    max_axis <- round(colMaxs(as.matrix(f$projection[sapply(f$projection, is.numeric)])),1)
    if (${"TRUE" if outlier_file.is_file() else "FALSE"}) {
        outliers <- read.table(${outlier_file:r}, col.names=c("FID", "IID"))
        plot_pcs = function(pca_final, x, y, title="") {
          ggplot(pca_final, aes_string(x=x, y=y)) + geom_point(aes(color=${label_col}), size=1) + geom_point(data=filter(pca_final, IID %in% outliers$IID), shape = 21, size=2, color='red', stroke = 1) + 
              labs(title=title,x=x, y=y) +
              scale_y_continuous(limits=c(min_axis, max_axis)) +
              scale_x_continuous(limits=c(min_axis, max_axis)) +
              theme_classic()
        }} else {
        plot_pcs = function(pca_final, x, y, title="") {
          ggplot(pca_final, aes_string(x=x, y=y)) + geom_point(aes(color=${label_col}), size=1) + 
              labs(title=title,x=x, y=y) +
              scale_y_continuous(limits=c(min_axis, max_axis)) +
              scale_x_continuous(limits=c(min_axis, max_axis)) +
              theme_classic()
        }}
    unit = 4
    n_col = 4
    n_row = ceiling(k / n_col)
    plots = lapply(1:(${plot_K}-1), function(i) plot_pcs(pca_final, paste0("PC",i), paste0("PC",i+1), dat$meta))
    png('${_output[0]}', width = unit * n_col, height = unit * n_row, unit='in', res=300)
    do.call(gridExtra::grid.arrange, c(plots, list(ncol = n_col, nrow = n_row)))
    dev.off()
    # Create scree plot
    PVE <- f$values
    PVE <- round(PVE/sum(PVE), 2)
    PVEplot <- qplot(c(1:k), PVE) + geom_line() + xlab("Principal Component") + ylab("PVE") + ggtitle("Scree Plot") + ylim(0, 1) +scale_x_discrete(limits=factor(1:k))
    PVE_cum <- cumsum(PVE)/sum(PVE)
    cumPVEplot <- qplot(c(1:k), cumsum(PVE)) + geom_line() + xlab("Principal Component") + ylab("PVE") + ggtitle("Cumulative PVE Plot") + ylim(0, 1) + scale_x_discrete(limits=factor(1:k))
    png('${_output[1]}', width = 8, height = 4, unit='in', res=300)
    grid.arrange(PVEplot, cumPVEplot, nrow = 1)
    dev.off()

## Detect outliers

In [None]:
##### Calculate Mahalanobis distance per population and report outliers
[detect_outliers]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Robust Mahalanobis to outliers
parameter: robust = True
parameter: pca_result = path
input: pca_result
output: distance=f'{_input:n}.{name}.mahalanobis.rds',
        removed_outliers=f'{_input:n}.{name}.outliers',
        analysis_summary=f'{_input:n}.{name}.analysis_summary.md',
        qqplot_mahalanobis=f'{_input:n}.{name}.mahalanobis_qq.png',
        hist_mahalanobis=f'{_input:n}.{name}.mahalanobis_hist.png'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container_lmm, expand = "${ }"
    echo '''---
    theme: base-theme
    style: |
      img {
        height: 80%;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
    ---    
    ''' > ${_output[2]}
    
R: container = container_lmm, expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    # Load required libraries
    library(dplyr)
    library(ggplot2)
    library(gridExtra)
  
    # invert a known covariance matrix but allow them to be numerically singular matrix (still assuming full rank)
    robust_inv = function(s) {
        tryCatch(solve(s), error=function(cond) solve(Matrix::nearPD(s)$mat))
    }

    # Calculate mahalanobis distance  
    calc_mahalanobis_dist = function(x, m, s, name = '', prob=${prob}) {
        pc <- x %>%
          select("IID","FID", starts_with("PC"))
        mu_pc <- pc %>%
        select(starts_with("PC"))
        mu = t(t(mu_pc) - m)
        pc$mahal = rowSums(mu %*% robust_inv(s) * mu)
        pc$p <- pchisq(pc$mahal, df=nrow(s), lower.tail=FALSE)
        manh_dis_sq_cutoff = quantile(pc$mahal, probs=prob,na.rm=TRUE)
        # Obtain outliers
        outliers = pc[(pc$mahal > manh_dis_sq_cutoff),]
        d_summary = paste0(capture.output(summary(pc$mahal)), collapse = '\n')
        msg = paste('#', name, "result summary\n## Mahalanobis distance summary:\n```\n", d_summary, "\n```\n", 
            paste("The cut-off for outlier removal is set to:", manh_dis_sq_cutoff, "and the number of individuals to remove is:", nrow(outliers),"\n"),
            paste("The new sample size after outlier removal is:",nrow(pc) - nrow(outliers),"\n"))
        #
        outliers <- outliers %>%
        mutate(FID = IID) %>% 
        select(FID, IID)%>% 
        filter(!is.na(IID))
      list(pc=pc, manh_dis_sq_cutoff=manh_dis_sq_cutoff, msg=msg, outliers=outliers)
    }

    dat = readRDS(${_input:r})
    if (is.list(dat$pc_mean)) {
      pops = names(dat$pc_mean)
      pop_group = split(dat$pc_scores, f = dat$pc_scores$${pop_col})
      res = lapply(pops, function(p) calc_mahalanobis_dist(pop_group[[p]], dat$${"pc_mean" if not robust else "pc_median"}[[p]], dat$pc_cov[[p]], name = paste(dat$meta, p)))
      names(res) = pops
      res = list(
          msg = do.call(paste, c(lapply(pops, function(p) res[[p]]$msg), sep = "\n")),
          manh_dis_sq_cutoff = cbind(pops, sapply(pops, function(p) res[[p]]$manh_dis_sq_cutoff)),
          outliers = do.call(rbind, c(lapply(pops, function(p) res[[p]]$outliers))),
          pc = do.call(rbind, c(lapply(pops, function(p) res[[p]]$pc)))
          )
    } else {
      res = calc_mahalanobis_dist(dat$pc_scores, dat$${"pc_mean" if not robust else "pc_median"}, dat$pc_cov, name = dat$meta)
    }
      
    write(res$msg, ${_output[2]:r})   
    # Plot mahalanobis
    k = dat$k
    png('${_output[3]}', width = 4, height = 4, unit='in', res=300)
    qqplot(qchisq(ppoints(100), df=k), res$pc$mahal, main = expression("Mahalanobis" * ~D^2 * " vs. quantiles of" * ~ chi[k]^2), xlab = expression(chi[2]^2 * ", probability points = 100"), ylab = expression(D^2), pch=16)
    dev.off() 
    png('${_output[4]}', width = 4, height = 4, unit='in', res=300)
    ggplot(res$pc, aes(x=mahal)) + geom_histogram(aes(y = ..count..), binwidth = 0.5, colour = "#1F3552", fill = "#4271AE") + scale_x_continuous(name = "Mahalanobis distance") + theme_classic()
    dev.off()
  
    # Save results and outliers
    saveRDS(res,${_output[0]:r})
    write.table(res$outliers, ${_output[1]:r}, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)