# Principal component analysis

The intention of this notebook is to perform the PCA analysis on genotype data and generate plots.

## Overview


Steps to generate a PCA include 

- removing related individuals
- pruning variants in linkage disequilibrium (LD)
- perform PCA analysis on genotype
- excluding outlier samples in the PCA space for individuals of homogeneous self-reported ancestry. These outliers may suggest poor genotyping quality or distant relatedness.

Pitfalls

1. Some of the PCs may capture LD structure rather than population structure (decrease in power to detect associations in these regions of high LD)
2. When projecting a new study dataset to the PCA space computed from a reference dataset: projected PCs are shrunk toward 0 in the new dataset
3. PC scores may capture outliers that are due to family structure, population structure or other reasons; it might be beneficial to detect and remove these individuals to maximize the population structure captured by PCA (in the case of removing a few outliers) or to restrict analyses to genetically homogeneous samples

## Input

1. `bfile` in PLINK format (`.bed`, `.bim` and `.fam`): the genotype array to calculate individual relationship as well as to perform PCA (when `genoFile` is not provided)
2. genoFile in PLINK format: could be different from bfile in the case of using for e.g. exome data/ imputed genotype data
3. relatedness file (you can calculate relatedness beforehand, using PLINK 2 which implements the KING algorithm)

## Output

Form the kinship analysis
1. Kinship table

From the PCA analysis
1. eigen values / singular values
2. eigen vectors
3. projection (PC's)
4. loadings
5. scale
6. mahalanobis distances for outlier removal

## General workflow

1. Estimate relatedness of the individuals in the sample by PLINK 2 that implements the KING algorithm
2. Select specific SNPs and samples using PLINK (QC maf>1%, geno missing rate (`geno`) 0.1, individual missing rate (`mind`) =0.1 and `hwe`=5e-08) 
3. SNPs thining by doing LD-pruning (window=50, shift=10, r2=0.1 are the defaults) and remove related individuals prior to PCA calculation
4. First PCA run using only unrelated individuals
5. Calculate mahalanobis distance per population and create outlier removal file. Default criteria is 6 SD from the mean but we recommend checking the output plot before and after removal and rethink about it.
6. Re-calculate PC's without outliers
7. Project related samples to the PC space
8. Second round of outlier removal

## Method

Here is a quick recap of PCA analysis for those not immediately familiar with the method. PCA is a mathematical method to reduce dimensionality of the data while retaining most of the variation in the dataset. 
This is accomplished by identifying directions or Principal Components (PC's) that account for the maximum variation in the data. 

One common approach to PCA is based on the singular-value decomposition of the the data matrix $X$ (in our case the genotype matrix),
$$X = U D V^T,$$
where $U$ are the left eigenvectors, $D$ is the diagonal matrix of singular values, and $V$ are the right eigenvectors (also called loadings). 

PCA can also be done using the eigen-decomposition of $X X^T$
$$X X^T = U S U^T,$$ where $S=D^2$ is the diagonal matrix of eigenvalues.
$X$ is usually centred (mean-subtracted) or standardised (mean subtracted, then divided by standard deviation) before PCA.

For PCA of SNP genotypes (at least in diploid organisms), the common standardisation is
$$X_{ij}^{\prime} = \frac{X_{ij} - 2p_j}{\sqrt{2 p_j (1 - p_j)}},$$
where $X_{ij}$ is the genotype (minor allele dosage $\{0, 1, 2\}$) for the $i$th individual and the $j$th SNP, and $p_j$ is the minor allele frequency (MAF) for the $j$th SNP. In addition, the eigenvalues are scaled by the number of SNPs $m$ (equivalent to performing the eigen-decomposition of $XX^T/m$).


## How to run this workflow

A minimal working example can be [obtained here](https://drive.google.com/drive/folders/15gpOTi7RKFnDYuiHbY5-F-n2GeLx2ZZt?usp=sharing).

FIXME: have to show people how to run the GWAS QC step, then the actual PCA workflow. Also have to perhaps do this multiple runs to reflect the iterative procedure 1 - 8 listed above. You can also fix the narratives below that i recycled from the QC pipeline specific to PCA context.

**Notice: there is no notion of steps in the pipeline definition; rather the steps below will be modular and you call them here with different steps. That is, your MWE demo will be running a few `sos` commands as different steps**


FIXME: Diana I cannot get the MWE work out of the box so I'll leave all my edits untested, and I'll rely on you to iron out edges

```
ERROR: Function named_output can only be used in input or depends statements
```

is the error message I get when I download the MWE and run the command.


### Step 1: Estimate kinship in the sample
Aim: identify and remove closely related individuals prior to PCA analysis.


### Step 2 and 3: QC the genetic data 

- Step 2: QC based on MAF, sample and variant missigness and Hardy-Weinberg Equilibrium. You can provide a list of samples to keep, for example in the case when you would like to subset a sample based on their ancestries to perform independent analyses on each of these groups.
- Step 3: In this step you have to prune SNPs in linkage dissequilibrium to make sure the PCA actually captures population structure and not LD structure (which could reduce the power of detecting genetic associations in these LD-regions). Also, related individuals are removed to generate the cleaned bfile to use in PCA calculations

### Step 4: PCA analysis

### Step 5: Generate bed file with outliers removed for 2nd round of PCA

### Step 6: Do second round of PCA

### Step 7: Project related individuals

### Step 8: Remove outliers

```
sos run ~/project/bioworkflows/GWAS/PCA.ipynb flashpca \
    --cwd ~/output \
    --bfile burden/genotypes_21_22_plink.exome.bed \
    --genoFile burden/ukb23155_c2*_b0_v1.plink.exome.filtered.bed \
    --phenoFile burden/phenotype_burden_pca.txt \
    --keep_samples burden/unrelated_ind_burden.txt \
    --k 10 \
    --window 50 \
    --shift 10 \
    --r2 0.5 \
    --maf_filter 0.5 \
    --geno_filter 0.2 \
    --mind_filter 0.2 \
    --hwe_filter 0.0\
    --trait_name ethnicity \
    --numThreads 1 \
    --job_size 1 \
    --container_lmm /gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
```

# Command interface

In [2]:
sos run PCA.ipynb -h

usage: sos run PCA.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  king
  filter
  flashpca

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --bfile VAL (as path, required)
                        Path to genotype array file
  --genoFile  paths

                        Plink binary files
  --phenoFile VAL (as path, required)
                        The phenotypic file
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --numThreads 1 (as int)
                        Number of threads
  --k VAL (as int, required)
                        Number of Principal Co

## PCA analysis pipeline

In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# Plink binary file
parameter: genoFile = path
# The phenotypic file
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Number of threads
parameter: numThreads = 20

# Name of the population column in the phenoFile (format FID, IID, father, mother, pop)
parameter: pop_col = str
parameter: pop = []
# Software container option
parameter: container_lmm = 'statisticalgenetics/lmm:1.8'
suffix = '_'.join(pop)

## PCA analysis

In [None]:
# Run PCA analysis using flashpca 
[flashpca_1]
# Number of Principal Components to output. Default is 10
parameter: k = 10
# How to standardize X before PCA
parameter: stand = "binom2"
input: genoFile, phenoFile
output: f'{cwd}/{phenoFile:bn}.{(suffix+".") if suffix != "" else ""}pca.rds'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container = container_lmm, expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    # Load required libraries
    library(flashpcaR)
  
    ###
    # FIXME: first extract data based on pop -- can we do that with flashpca?
    ###
  
    f <- flashpca(${_input[0]:nr}, ndim=${k}, stand="${stand}", do_loadings=TRUE, check_geno=TRUE)
    # Use the projection file to generate pca plot
    pca <- f$projection
    pc_cov <- cov(f$projection[,-1])
    pc_mean <- apply(f$projection[,-1], 2, mean)
    colnames(pca) <- c("ID",paste0("PC", 1:${k}))
    pca$IID <- sapply(strsplit(as.character(pca$ID),':'), "[", 1)
    # Read fam file with phenotypes
    pheno <- read.table(${_input[1]:r}, sep="\t", header=T )
    pca <-merge(pheno, pca, by="IID", all=FALSE) 
  
    # save results
    saveRDS(list(pca_model = f, pc_scores = pca, pc_mean=pc_mean, pc_cov = pc_cov, meta = "${_input[1]:bn} ${suffix}"), ${_output:r})

In [None]:
# Run PCA analysis using flashpca 
[plot_pca]
parameter: pca_result = str
input: pca_result
output: f'{cwd}/{_input:bn}.pc.png',
        f'{cwd}/{_input:bn}.scree.png'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container = container_lmm, expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'   
    library(dplyr)
    library(ggplot2)
    library(gridExtra)
  
    dat = readRDS(${_input:r})
    f = dat$pca_model
    pca_final = dat$pc_scores
    k = length(dat$pc_mean)
    ###
    # Make the plots
    ###
    # Get the min and max values for x and y-axes 
    min_axis <- round(colMins(as.matrix(f$projection[sapply(f$projection, is.numeric)])),1)
    max_axis <- round(colMaxs(as.matrix(f$projection[sapply(f$projection, is.numeric)])),1)
    plot_pcs = function(pca_final, x, y, title="") {
      ggplot(pca_final, aes_string(x=x, y=y)) + geom_point(aes(color=${pop_col}, shape=${pop_col}), size=2) 
          + labs(title=title,x=x, y=y) 
          + scale_y_continuous(limits=c(min_axis, max_axis))  
          + scale_x_continuous(limits=c(min_axis, max_axis)) 
          + theme_classic()
    }
    n_col = 3
    n_row = 1
    plots = lapply(1:(k-1), function(i) plot_pcs(pca, paste0("PC",i), paste0("PC",i+1), dat$meta))
    png('${_output[0]}', width = 8, height = 4, unit='in', res=300)
    do.call(gridExtra::grid.arrange, c(plots, list(ncol = n_col, nrow = n_row)))
    dev.off()
    # Create scree plot
    PVE <- f$values
    PVE <- round(PVE/sum(PVE), 2)
    PVEplot <- qplot(c(1:k), PVE) + geom_line() + xlab("Principal Component") + ylab("PVE") + ggtitle("Scree Plot") + ylim(0, 1) +scale_x_discrete(limits=factor(1:10))
    PVE_cum <- cumsum(PVE)/sum(PVE)
    cumPVEplot <- qplot(c(1:k), cumsum(PVE)) + geom_line() + xlab("Principal Component") + ylab("PVE") + ggtitle("Cumulative PVE Plot") + ylim(0, 1) + scale_x_discrete(limits=factor(1:10))
    png('${_output[1]}', width = 8, height = 4, unit='in', res=300)
    grid.arrange(PVEplot, cumPVEplot, ncol = 2)
    dev.off()

**FIXME: need an inner loop to do this per population** I leave it a parameter `per_pop`; in this case both global mean cov and per pop mean cov have to be saved from `flashpca` step.

In [None]:
# Calculate Mahalanobis distance per population and report outliers
[flash_pca_2, project_samples_2]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.95
parameter: per_pop = False
parameter: pca_result = path
output: distance=f'{_input:n}.mahalanobis',
        removed_outliers=f'{_input:n}.no_outliers',
        analysis_summary=f'{_input:n}.analysis_summary.md',
        qqplot_mahalanobis=f'{_input:n}.mahalanobis_qq.png',
        hist_mahalanobis=f'{_input:n}.mahalanobis_hist.png'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container_lmm, expand = "${ }"
    echo '''---
    theme: base-theme
    style: |
      img {
        height: 80%;
        display: block;
        margin-left: auto;
        margin-right: auto;
      }
    ---    
    ''' > ${_output[2]}
    
R: container = container_lmm, expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    # Load required libraries
    library(dplyr)
    library(ggplot2)
    library(gridExtra)
  
    dat = readRDS(${_input:r})
    pc <- dat$pc_scores %>%
          select("IID", "${pop_col}",starts_with("PC"))
    k = length(dat$pc_mean)
    # Calculate mahalanobis distance
    mu = t(t(pc[,c(-1,-2)]) - dat$pc_mean)
    pc$mahal = rowSums(mu %*% solve(dat$pc_cov) * mu)
    pc$p <- pchisq(pc$mahal, df=k, lower.tail=FALSE)
    
    # Set the cut-off to calculate outliers
    out = capture.output(summary(pc$mahal))
    write('# ${phenoFile:b} ${suffix} result summary\n## Mahalanobis distance summary:\n```', ${_output[2]:r}, append = T)
    write.table(out, '${_output[2]}', sep="\n",append=TRUE)
    write("```", ${_output[2]:r}, append = T)
    
    manh_dis_sq_cutoff = quantile(pc$mahal,probs = ${prob}) # eg can use something like 6 sd from the mean chi-square double sided
    write(paste("The cut-off for outlier removal is set to:",manh_dis_sq_cutoff,"and the number of individuals removed is:", length(which(pc$mahal >= manh_dis_sq_cutoff)),"\n"), ${_output[2]:r}, append = T)
    
    # Plot mahalanobis
    png('${_output[3]}', width = 4, height = 4, unit='in', res=300)
    qqplot(qchisq(ppoints(100), df=k), pc$mahal, main = expression("Q-Q plot of Mahalanobis" * ~D^2 * " vs. quantiles of" * ~ chi[k]^2), xlab = expression(chi[2]^2 * ", probability points = 100"), ylab = expression(D^2))
    dev.off()
    
    png('${_output[4]}', width = 4, height = 4, unit='in', res=300)
    ggplot(pc, aes(x=mahal)) + geom_histogram(aes(y = ..count..), binwidth = 0.5, colour = "#1F3552", fill = "#4271AE") + scale_x_continuous(name = "Mahalanobis distance") + theme_classic()
    dev.off()
    # Obtain the new sample
    new_sample = pc[(pc$mahal <= manh_dis_sq_cutoff),1]
    cat("The new sample size after outlier removal is:",length(new_sample),"\n", ${_output[2]:r}, append = T)
    
    sample_df <- pc %>%
    mutate(FID = IID) %>%
    select(FID, IID, ${pop_col},starts_with("PC"), mahal, p)
  
    new_sample_no_out <- pc %>%
    filter(IID %in% new_sample) %>%
    mutate(FID=IID) %>%
    select(FID,IID)
  
    # Save file with mahalanobis distance and IID for PCA recalculation
    write.table(sample_df,${_output[0]:r}, sep="\t", quote=FALSE, row.names=FALSE, col.names=TRUE)
    write.table(new_sample_no_out,${_output[1]:r}, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE)

## Project related individuals back

In [None]:
# Re-do PCA without outliers
[project_samples_1]
# Here genoFile is samples to be projected back
parameter: pca_result = path
input: genoFile, phenoFile, pca_result
output: f'{_input[1]:n}.projected.rds'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container=container_lmm, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    # Load required libraries
    library(dplyr)
    library(flashpcaR)
    # Read the PLINK binary files
    frel <- ${_input[0]:nr}
    # Read loadings, center and scale from previous PCA
    dat <- readRDS(${_input[2]:r})
    f <- dat$pca_model
    # Read the bim file to obtain reference alleles
    bim <- read.table('${_input[0]:n}.bim')
    ref <- as.character(bim[,5])
    names(ref) <- bim[,2]
    # Project related samples
    fpro <- project(frel, loadings=f$loadings, orig_mean=f$center, orig_sd=f$scale, ref_allele=ref)
    pca <- fpro$projection
    k = length(dat$pc_mean)
    colnames(pca) <- c("ID",paste0("PC", 1:k))
    pca$IID <- sapply(strsplit(as.character(pca$ID),':'), "[", 1)
    # Read fam file with phenotypes
    pheno <- read.table(${_input[1]:r}, sep="\t", header=T )
    pca <-merge(pheno, pca, by="IID", all=FALSE) 
    dat$pc_scores = rbind(dat$pc_scores, pca)
    
    saveRDS(dat, ${_output:r})