# PoPs Workflow

Polygenic Priority Score (PoPS) evaluates the associations between traits and genes, aiming to prioritizing the causal genes for complex biological traits. It incorporates GWAS summary statistics with gene expression, biological pathway, and predicted protein-protein interaction data. This pipeline is modified from the [FinucaneLab's code](https://github.com/FinucaneLab/gene_features/tree/master/code). 

> [Weeks, E. M., Ulirsch, J. C., Cheng, N. Y., Trippe, B. L., Fine, R. S., Miao, J., ... & Finucane, H. K. (2020). Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. medRxiv.](https://doi.org/10.1101/2020.09.08.20190561;)

## Method

### Primary GLS model 

$$y=X^f\beta^f+\varepsilon,\ \ \varepsilon\sim MNV(0,R)$$

For the primary step, PoPS takes GWAS summary statistics (gene-level z scores from magma) as y and gene features extracted from gene expression, biological pathways, and predicted PPI networks as x, building a Generalized Least Squares (GLS). The model accounts for the correlation between genes by variance covariance matrix R.

### Final GLS model 

$$y=C\alpha+X^f\beta^f+\varepsilon,\ \ \varepsilon\sim MNV(0,R)$$

After fitting the primary GLS model, PoPS filters gene featues by marginal selection. For the remianing features, a final GLS model is built by Leave One Chomosome Out (LOCO) framework and ridge penalty. For each gene $g$ on the chromosome $i$, it estimates the coefficient for each remianing gene feature $\hat{\beta}_{-chr i}$. The model also incorporates a term $C\alpha$ for gene covariants (i.e. gene length, effective gene size).

### PoPS calculation

$$\hat{y}_{g}=X_{g}\hat{\beta}_{-chr i}$$

For the gene $g$ on the chromosome $i$, he Polygenic Priority Score $\hat{y}$ equals to $X_{g}$ times $\hat{\beta}_{-chr i}$.

![image.png](https://github.com/UxxUnet/prototype/blob/main/Picture1.png?raw=true)

## Overview

### Step 1 - MAGMA

#### Inputs
        
  - Reference panel

    `--bfile` A binary PLINK format data set, consisting of a .bed, .bim and .fam trio of files, is required for the reference panel. The `1000G.EUR.bed/bim/fam` files contain the necessary reference panel data for Europeans in the 1000 Genomes Project phase 3.

  - MAGMA gene annotation

    `--gene_annot` The MAGMA gene annotation file is created by running MAGMA with the `--annotate` flag. Each row of the MAGMA annotation file corresponds to a gene and containings the gene ID, a specification of the gene's location, and a list of SNP IDs of SNPs mapped to that gene. The `magma_0kb.genes.annot` file is a MAGMA annotation file for the 18,383 protein coding genes using SNPs in the 1000 Genomes phase 3 reference panel and a 0 Kb window around the gene body.

  - GWAS summary statistics

    `--pval` the GWAS summary statistics file. If it designates the name of the column containing the per SNP sample size, use `ncol=<colname>`.

#### Outputs

  - `.genes.out` file
  
    The `.genes.out` file contains the MAGMA gene analysis results in human-readable format. This file contains the gene z-scores and relevant data to construct the control covariates in the joint prediction model.

  - `.genes.raw` file
  
    The `.genes.raw` file is the intermediary file that serves as the input for subsequent analyses. This file contains the required data to consturct the gene-gene correlation matrix.

### Step 2 - Gene features

* Read in, QC, filter, scale, and normalize data

* Perform PCA and independent component analysis (ICA) across all cells or meta-cells or tissues

* Perform clustering and UMAP and plot features on projection

* Perform differential expression analysis

#### Inputs

  - Gene expression
  
     `--expr-matrix` The RNA-seq data for the gene expression
     
  - Mata data
  
     `--meta` Additional cell-level metadata to add to the Seurat object. Should be a data frame where the rows are cell names and the columns are additional metadata fields.

  - Gene annotations
  
     `--anno_file`  The gene ENSG id, chr, position and TSS annotations
     
  - Gene symbol
  
     `--symbol` For gene ENSG id to symbol transfering

#### Output

  - Unweighted gene loadings from PCA
  
    `/features/human_airway_projected_pcaloadings.txt.gz`
    
  - Unweighted gene loadings from ICA 
  
    `/features/human_airway_projected_icaloadings.txt.gz`
    
  - Unweighted gene loadings from PCA within cluster 
  
    `features/human_immune/projected_pcaloadings_clusters.txt.gz`
    
  - Average expression across clusters (pre-defined and identified) 
  
    `/features/human_airway_average_expression.txt`
  
  - One-vs-all t-test statistic (pre-defined and identified) 
  
    `/features/human_airway_diffexprs_tstat_clusters.txt`
  
  - Differentially expressed genes across clusters (pre-defined and identified) 
  
    `/features/human_airway_diffexprs_genes_clusters.txt`

### Step 3 - Feature selection

#### Inputs

  - Gene features
  
    `--features` The gene feature file incoprates the features from gene expression (from `[genefeature]`), biological pathway, and predicted protein-protein interaction data.
     
  - Gene association results
  
    `--gene_results` The gene association results produced from `[magma]`


#### Output

  - Selected features
  
    `.features file` It contains the names of the marginally selected features. This file has no header and contains the name of one feature per row.
  

### Step 4 - Predict scores

#### Inputs

  - Gene location file
  
     `--gene_loc` The gene location file

  - Gene association results
  
     `--gene_results` The gene association results produced from `[magma]`
     
  - Gene features
  
     `--features` The gene feature file incoprates the features from gene expression (from `[genefeature]`), biological pathway, and predicted protein-protein interaction data.
     
  - Selected features

    `.selected_features` The selected features produced from `[feature_selection]`

  - Control fetures
  
     `--control_features` The list of control fetures

  - Chromosome index
  
     `--chromosome` This flag designates the chromosome for which to compute PoP scores.

#### Output

  - `.{chomosome}.results` file
  
    The `.results` file contains the predicted PoP scores for each gene on the designated chromosome.

  - `.{chomosome}.coefs` file
  
    The `.coefs` file contains the estimated $\hat{\beta}$ for each feature from fitting the PoPS model leaving out the designated chromosome.

In [1]:
sos run pops.ipynb -h

usage: sos run pops.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  genefeature
  magma
  feature_selection
  predict_scores

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container-pops 'guangyou/pops:1.0.0'
                        Software container option

Sections
  genefeature:
    Workflow Options:
      --expr-matri

## A working example

In [None]:
sos run pops.ipynb magma\
    --cwd ~/pops/code/ \
    --bfile ~/pops/data/1000G.EUR.bed\
    --gene_annot ~/pops/data/magma_0kb.genes.annot\
    --pval ~/pops/data/AFib-gwas-summary-statistics.tbl\
    --snpcol rs_dbSNP147\
    --pvalcol Pvalue\
    --container_pops ~/pops.sif\
    --out ~/pops/data/AFib

In [6]:
sos run pops.ipynb genefeature\
    --cwd ~/pops/code/ \
    --expr_matrix ~/pops/data/human_airway/Raw_exprMatrix.tsv.gz \
    --tsv ~/pops/data/human_airway/meta.tsv \
    --anno_file ~/pops/resources/gene_annot_jun10.txt \
    --symbol ~/pops/resources/ \
    --container_pops ~/pops.sif \
    --name human_airway 

INFO: Running [32mgenefeature[0m: 
INFO: [32mgenefeature[0m is [32mcompleted[0m.
INFO: [32mgenefeature[0m output:   [32m/home/gl2776/pops/code/plots/human_airway_variablegenes.pdf /home/gl2776/pops/code/plots/human_airway_pcaelbow.pdf... (10 items)[0m
INFO: Workflow genefeature (ID=wb99e6adc1a4fdba7) is executed successfully with 1 completed step.



In [2]:
sos run pops.ipynb feature_selection\
    --cwd ~/pops/code/ \
    --features ~/pops/data/PoPS.features.txt.gz\
    --gene_results ~/pops/data/AFib.genes.out\
    --container_pops ~/pops.sif \
    --out ~/pops/data/AFib

INFO: Running [32mfeature_selection[0m: Feature selection by the GLS model
INFO: [32mfeature_selection[0m is [32mcompleted[0m.
INFO: [32mfeature_selection[0m output:   [32m/home/gl2776/pops/data/AFib.features[0m
INFO: Workflow feature_selection (ID=w925cbf535a40c4f2) is executed successfully with 1 completed step.



In [5]:
sos run pops.ipynb predict_scores\
    --cwd ~/pops/code/ \
    --gene_loc ~/pops/data/gene_loc.txt\
    --gene_results ~/pops/data/AFib.genes.out\
    --features ~/pops/data/PoPS.features.txt.gz\
    --selected_features ~/pops/data/AFib.features\
    --control_features ~/pops/data/control.features\
    --chromosome 1\
    --container_pops ~/pops.sif \
    --out ~/pops/data/AFib

INFO: Running [32mpredict_scores[0m: Polygenic Priority Score calculation
INFO: [32mpredict_scores[0m is [32mcompleted[0m.
INFO: [32mpredict_scores[0m output:   [32m/home/gl2776/pops/data/AFib.1.results /home/gl2776/pops/data/AFib.1.coefs[0m
INFO: Workflow predict_scores (ID=w0a50b74d94ea93e0) is executed successfully with 1 completed step.



In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Software container option
parameter: container_pops = 'guangyou/pops:1.0.0'
cwd = f"{cwd:a}"

In [None]:
# Get gene-level z scores
[magma]
# Path of the plink format file set for the reference panel, including .bed/.fam/.bim
parameter: bfile = paths
# Path to MAGMA gene annotation file
parameter: gene_annot = path
# Path to summary statistics file
parameter: pval = path
# The colname (or name) of SNP id in summary statistics file
parameter: snpcol = str
# The colname (or name) of p value id in summary statistics file
parameter: pvalcol = str
# Path prefix for output. MAGMA will append .genes.out and .genes.raw to this prefix.
parameter: out = path
input: bfile, gene_annot, pval
output: f'{out}.genes.out', f'{out}.genes.raw'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_pops, expand= "${ }", stderr = f'{cwd}/{out:b}.stderr', stdout = f'{cwd}/{out:b}.stdout' 
    n=$(wc -l < ${_input[2]:r})

    magma \
      --bfile ${_input[0]:n} \
      --gene-annot ${_input[1]:r} \
      --pval  ${_input[2]:r} use=${snpcol},${pvalcol} N=$n \
      --gene-model snp-wise=mean \
      --out ${_output[0]:nn} 

In [13]:
[genefeature]
# Path to expression matrix file
parameter: expr_matrix = path
# Path to the tsv file 
parameter: tsv = path
# Path to the annotation file 
parameter: anno_file = path
# Path to the symbol file 
parameter: symbol = path
# A string to identify your analysis run
parameter: name = str
# Number of PCs applied
parameter: number_pcs = 35
# Number of features to select as top variable features
parameter: vargenes = 1500
# Value of the resolution parameter, use a value above (below) 1.0 if you want to obtain a larger (smaller) number of communities
parameter: clus_res = 0.6
input: expr_matrix, tsv, anno_file, symbol
output: f'{cwd}/plots/{name}_variablegenes.pdf',
        f'{cwd}/plots/{name}_pcaelbow.pdf',
        f'{cwd}/plots/{name}_umap_clusters.pdf',
        f'{cwd}/plots/{name}_umap_clusters_pre_def.pdf',
        f'{cwd}/plots/{name}_umap_pcs.pdf',
        f'{cwd}/plots/{name}_umap_ics.pdf',
        f'{cwd}/plots/{name}_umap_knownmarkers.pdf',
        f'{cwd}/plots/{name}_umap_degenes.pdf',
        f'{cwd}/plots/{name}_umap_degenes_pre_def.pdf',
        f'{cwd}/features/{name}_so.rds'
task: trunk_workers = 1, walltime = "10h", mem = "24G", cores = 24, tags = f'{step_name}_{name}'
R:  container=container_pops, expand= "${ }", stderr = f'{name}.stderr', stdout = f'{name}.stdout'
    library(tidyverse)
    library(data.table)
    library(BuenColors)
    library(Seurat)
    library(irlba)
    library(Matrix)
    library(future)
    library(reticulate)
    library(ggrastr)
    library(tidytext)
    library(matrixTests)
    # Need a function source file, or the code would be too long
    source("utils.R")
    # Set up parallelization
    # Remember to use htop to delete forgotten forks
    Sys.setenv(R_FUTURE_FORK_ENABLE = T)
    options(future.globals.maxSize = 6 * 2048 * 1024^2)
    # plan(strategy = "multicore", workers = 32)
    name <- "${name}"
    # Setup dictionary
    dir.create("./plots/")
    dir.create("./features/")
    # Notes on data:
    # Annotations provided. There is a mild batch effeect by method, but correcting it breaks the clustering.
    # So we ignore it. Looking at pre-defined clusters, we are clearly still capturing biology.
    ### Assumes a sparse dgCMatrix as input
    ### Accepts row_id_type = ENSG, ENSMUSG, human_symbol, mouse_symbol
    #------------------------------------------------LOAD AND FORMAT DATA-----------------------------------------------#
    # Read in data and annotations
    mat <- data.frame(fread(${_input[0]:r}), row.names=1)[1:10000,1:10000] %>%
      data.matrix() %>%
      Matrix(sparse = TRUE)
    mat.annot <- data.frame(fread(${_input[1]:r}), row.names=1, header=T)
    colnames(mat) <-  gsub("[.]", "-", colnames(mat))

    # Convert to ENSG, drop duplicates, and fill in missing genes
    mat <- ConvertToENSGAndProcessMatrix(mat, "human_symbol", ${_input[3]:r}, ${_input[2]:r})

    # Load this in in case we need it later
    keep <- read.table(${_input[2]:r}, sep = "\t", header = T, stringsAsFactors = F, col.names = c("ENSG", "symbol", "chr", "start", "end", "TSS"))

    #--------------------------------------------------COMPUTE FEATURES-------------------------------------------------#

    # Create Seurat object
    # min.features determined for each dataset
    so <- CreateSeuratObject(counts = mat, project = name, min.features = 200, meta.data = mat.annot)

    # Clean up
    rm(mat)

    # QC
    so <- subset(so, 
                 subset = nFeature_RNA > quantile(so$nFeature_RNA, 0.05) & 
                   nFeature_RNA < quantile(so$nFeature_RNA, 0.95))
    so <- NormalizeData(so, normalization.method = "LogNormalize", scale.factor = 1000000)
    so <- ScaleData(so, min.cells.to.block = 1, block.size = 500)

    # Identify variable genes
    so <- FindVariableFeatures(so, nfeatures = ${vargenes})

    # Plot variable genes with and without labels
    PlotAndSaveHVG(so, ${_output[0]:r})
    
    # Run PCA
    so <- RunPCA(so, npcs = 100)
    # Project PCA to all genes
    so <- ProjectDim(so, do.center = T)
    # Plot Elbow
    PlotAndSavePCAElbow(so, 100, ${_output[1]:r})

    # Run ICA
    so <- RunICA(so, nics = ${number_pcs})
    # Project ICA to all genes
    so <- ProjectDim(so, reduction = "ica", do.center = T)

    # Cluster cells
    so <- FindNeighbors(so, dims = 1:${number_pcs}, nn.eps = 0)
    so <- FindClusters(so, resolution = ${clus_res}, n.start = 100)

    # UMAP dim reduction
    so <- RunUMAP(so, dims = 1:${number_pcs}, min.dist = 0.4, n.epochs = 500,
                  n.neighbors = 10, learning.rate = 0.1, spread = 2)

    # Plot UMAP clusters
    PlotAndSaveUMAPClusters(so, so@meta.data$seurat_clusters, ${_output[2]:r})
    # Plot known clusters on UMAP (if applicable)
    PlotAndSaveUMAPClusters(so, so@meta.data$CellType, ${_output[3]:r})

    # Plot PCs on UMAP
    PlotAndSavePCsOnUMAP(so, ${_output[4]:r})
    # Plot ICs on UMAP
    PlotAndSaveICsOnUMAP(so, ${_output[5]:r})
    # Plot known marker genes on UMAP 
    marker_genes <- c("CCDC67", "DEUP1", "FOXN4", "CDC20B", "RERGL", "MCAM", "PDGFRB", "ACTA2", "MYL9", "ASCL3", "CFTR", "FOXJ1", "MUC5AC", "SFTPA2", "CA2", "CAV1", "ANXA3", "CAV2")
    PlotAndSaveKnownMarkerGenesOnUMAP(so, keep, marker_genes, ${_output[6]:r})
  
    # Save global features
    featurepathprefix=paste0("./features/","${name}","_")
    SaveGlobalFeatures(so, featurepathprefix)

    # Compute any cluster dependent features (DE genes, within-cluster PCs, etc.) and save them
    # Seurat clusters
    Idents(object=so) <- "seurat_clusters"
    clus <- levels(so@meta.data$seurat_clusters)
    demarkers <- WithinClusterFeatures(so, "seurat_clusters", clus, featurepathprefix)

    # Pre-defined cluster dependent features (if applicable)
    Idents(object=so) <- "CellType"
    clus <- unique(so@meta.data$CellType)
    demarkers_pre_def <- WithinClusterFeatures(so, "CellType", clus, featurepathprefix, suffix = "_pre_def")
    
    # Plot DE genes on UMAP
    PlotAndSaveDEGenesOnUMAP(so, demarkers, ${_output[7]:r}, height = 30, rank_by_tstat = TRUE)
    # Plot DE genes from pre-defined clusters on UMAP (if applicable)
    PlotAndSaveDEGenesOnUMAP(so, demarkers_pre_def, ${_output[8]:r}, height = 30, rank_by_tstat = TRUE)

    # Save Seurat object
    saveRDS(so, ${_output[9]:r})

In [None]:
# Feature selection by the GLS model
[feature_selection]
# Path to the gene feature file
parameter: features = path
# This flag gives the prefix for location of the gene association results from [magma]
parameter: gene_results = path
# Path prefix for output. It will append .features to this prefix.
parameter: out = path
input: features, gene_results
output: f'{out}.features'
task: trunk_workers = 1, walltime = '10h', mem = '20G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_pops, expand= "${ }", stderr = f'{cwd}/{out:b}.stderr', stdout = f'{cwd}/{out:b}.stdout' 
    pops.feature_selection.py\
      --features ${_input[0]:r}\
      --gene_results ${_input[1]:nn}\
      --out ${_output:n}

In [None]:
# Polygenic Priority Score calculation
[predict_scores]
# Path to the gene location file 
parameter: gene_loc = paths
# This flag gives the prefix for location of the gene association results from [magma]
parameter: gene_results = path
# Path to the gene feature file
parameter: features = path
# This flag gives the prefix for the location of the list of selected features from [feature_selection]
parameter: selected_features = path
# Path to the list of control fetures
parameter: control_features = path
# The index of chromosome for which to compute PoP scores
parameter: chromosome = int
# Path prefix for output. It will append .coefs and .results to this prefix
parameter: out = path
input: gene_loc, gene_results, features, selected_features, control_features
output: f'{out}.{chromosome}.results',f'{out}.{chromosome}.coefs'
task: trunk_workers = 1, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_pops, expand= "${ }", stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout'
    pops.predict_scores.py\
      --gene_loc ${_input[0]:r}\
      --gene_results ${_input[1]:nn}\
      --features ${_input[2]:r}\
      --selected_features ${_input[3]:r}\
      --control_features ${_input[4]:r}\
      --chromosome ${chromosome}\
      --out ${_output[0]:nn}