# QC and normalization for Pseudobulk data


## Description

This notebook implement the QC and normalization procedure used in [this paper](https://www.biorxiv.org/content/10.1101/2022.11.07.515446v1). The input is raw read count, and the output is log2CPM. The QC procedures are:




1. Exclude low-expression genes: At least 10 counts in >70% of donors

2. Exclude low cell counts individual: at least 10 cells in corresponding cell type

3. Compute log2 of count-per-million (log2CPM) via TMM and limma and voom

4. Exclude genes with 80% sample log2CPM < 2.0

5. TMM normalization

## Input

1. Gene expression matrix in raw count. A data table with gene ID as first column and each sample ID as a subsquent columns.
2. A two columns table documenting the number of cells for each of the sampels. This should be in the naming convention of A.nCells for a count table with name a.count_matrix and put in the same diretory as 1.

## Output

1. A QC and normalized Gene expression matrix in log2cpm, to be fed into the gene_annotation module for annotation.

## Minimal Working Example Steps

### vi. Multi-sample RNA-seq QC

Timing: <15min

Implement pseudobulk RNA-seq QC and normalization that identifies and removes genes and samples from the raw count matrix

## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
|  |  |  |  |  |




## Command interface

In [1]:
sos run pseudobulk_expression_QC_and_normalization.ipynb -h

usage: sos run pseudobulk_expression_QC_and_normalization.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  qc

Global Workflow Options:
  --phenoFile VAL (as path, required)
                        Required input is raw count table and cell table
  --cell-table VAL (as path, required)
  --cwd output (as path)
  --container ''
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
              

In [1]:
pwd

/sc/arion/work/sunh14/git/xqtl-protocol/code/molecular_phenotypes/QC


In [3]:
sos run pseudobulk_expression_QC_and_normalization.ipynb --phenoFile  ~/snmulti_QTL/input/gene_exp/Astro.count_matrix --cwd ~/snmulti_QTL/input/gene_exp/ -n

INFO: Checking [32mqc_1[0m: 
HINT: Rscript SCRIPT
# Load packages
library(tidyverse)
library(edgeR)
library(limma)

# Read data and parameter
# Read data and parameters
count_table = read_delim('/hpc/users/sunh14/snmulti_QTL/input/gene_exp/Astro.count_matrix')
cellcounts = read_delim("/hpc/users/sunh14/snmulti_QTL/input/gene_exp/Astro.nCells")
low_cell_count_filter_threshold = 10
low_expr_gene_count_filter_threshold = 10
low_expr_gene_count_gene_filter_percent = 0.7 
log2cpm_gene_filter_threshold = 2.0
log2cpm_gene_filter_percent = 0.8



# Filter out samples with fewer than low_cell_count_filter_threshold cells in a cell type
sampnames = cellcounts%>%filter(ncell >= low_cell_count_filter_threshold)%>%pull(sample)
gene_name = count_table$index
count_table = count_table[,c(colnames(count_table)[1],sampnames)]

# Filter low expression genes
y <- DGEList(counts = count_table)
keep <- ((count_table[,sampnames] >= low_expr_gene_count_filter_threshold )%>%rowSums/ncol(count_table) > low_ex

## Workflow implementation

In [1]:
[global]
# Required input is raw count table and cell table
parameter: phenoFile = paths
#parameter: cell_table = paths
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
#file_inv = [item for pair in zip(phenoFile, cell_table) for item in pair]


In [2]:
[qc_1]
parameter: low_cell_count_filter_threshold = 10
parameter: low_expr_gene_count_filter_threshold = 10
parameter: low_expr_gene_count_gene_filter_percent = 0.7 # Remove if more than 70% donor have less than 10 count
parameter: log2cpm_gene_filter_threshold = 2.0
parameter: log2cpm_gene_filter_percent = 0.8 # Remove if more than 80% donor have less than 2 log2cpm
input: phenoFile, group_by = 1,for_each = "region"
output: f'{cwd}/{_input[0]:bnnn}.{_region}.low_expression_filtered.normalized.log2cpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log',container = container, entrypoint = entrypoint
    # Load packages
    library(tidyverse)
    library(edgeR)
    library(limma)

    # Read data and parameter
    # Read data and parameters
    count_table = read_delim(${_input:r})
    cellcounts = read_delim("${_input:n}.nCells")
    ## Filter by brainRegionList
    BrainRegion = read_delim(${BrainRegionList:r})
    Rsamples = BrainRegion%>%filter(class == "${_region}")%>%pull(order)
    Rsamples = intersect(Rsamples,colnames(count_table) )
    cellcounts = cellcounts%>%filter(sample %in% Rsamples)
    count_table = count_table[,c(colnames(count_table)[1],Rsamples)]
    low_cell_count_filter_threshold = ${low_cell_count_filter_threshold}
    low_expr_gene_count_filter_threshold = ${low_expr_gene_count_filter_threshold}
    low_expr_gene_count_gene_filter_percent = ${low_expr_gene_count_gene_filter_percent} 
    log2cpm_gene_filter_threshold = ${log2cpm_gene_filter_threshold}
    log2cpm_gene_filter_percent = ${log2cpm_gene_filter_percent}
    
    
    # Filter out samples with fewer than low_cell_count_filter_threshold cells in a cell type
    sampnames = cellcounts%>%filter(ncell >= low_cell_count_filter_threshold)%>%pull(sample)
    count_table = count_table[,c(colnames(count_table)[1],sampnames)]
    print(paste0("Total input samples are ",   ncol(count_table) -1,  " . ", length(sampnames), ". samples remains after filter out low cell count samples, ", ncol(count_table) - length(sampnames) - 1, ". samples are removed. The removed samples are ", paste(colnames(count_table)[!colnames(count_table)%in%c(colnames(count_table)[1],sampnames)] , collapse = " , "  )))    
    # Filter low expression genes
    y <- DGEList(counts = count_table)
    keep <- ((count_table[,sampnames] >= low_expr_gene_count_filter_threshold )%>%rowSums/ncol(count_table) > low_expr_gene_count_gene_filter_percent )
    print(paste0( "Total input genes are " ,   nrow(y$counts)  ,  " . ",  sum(keep), " genes remains after filter low count genes ", nrow(y$counts) - sum(keep) , " genes are removed" ))
    y <- y[keep, , keep.lib.sizes = FALSE]
    
    # Counts per million and log2 transformation
    y <- calcNormFactors(y, method = "TMM")
    v <- voom(y, plot = FALSE)
    logcpm <- v$E
    
    # Remove genes if mean log2CPM < log2cpm_gene_filter_threshold for log2cpm_gene_filter_percent of samples
    keep <- ((logcpm[,sampnames] >= log2cpm_gene_filter_threshold )%>%rowSums/ncol(logcpm) > log2cpm_gene_filter_percent )
    print(paste0(sum(keep), ". genes remains after filter low log2cpm genes ", nrow(logcpm) - sum(keep) , ". genes are removed"))
    logcpm <- logcpm[keep,]
    
    # Save log2CPM
    logcpm%>%as_tibble(rownames = "ID")%>%write_delim(${_output:r},"\t")    

In [1]:
[SE_qc_1]
# Brain Region List
parameter: BrainRegionList = path
import pandas as pd
region = pd.read_csv(BrainRegionList, index_col = False, sep = "\t")
region = list(set(region.iloc[:,0]))
parameter: celltypes = path
celltypes = pd.read_csv(celltypes, index_col = False, sep = "\t")
celltypes = list(set(celltypes.iloc[:,0]))
input: phenoFile, group_by = 1,for_each = ["region", "celltypes"]
output: f'{cwd}/{_input:bnnn}.{_celltypes}.{_region}.low_expression_filtered.normalized.log2cpm.dreamlet.rds',f'{cwd}/{_input:bnnn}.{_celltypes}.{_region}.low_expression_filtered.normalized.log2cpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "${ }", stderr = f'{_output[0]:nnn}.stderr', stdout = f'{_output[0]:nnn}.log',container = container, entrypoint = entrypoint
    # Load packages
    library(dplyr)
    library(readr)
    library(dreamlet)
    library(SingleCellExperiment)
    ## Load data
    data  = get(load("${_input}"))
    ## Subset with only the desired brain regions]
    ncell = data@int_colData$n_cells[data@colData%>%as_tibble%>%filter(BrainRegion2 == "${_region}")%>%pull(order)]
    data = data[,data@colData%>%as_tibble%>%filter(BrainRegion2 == "${_region}")%>%pull(order)]
    data@int_colData$n_cells = ncell

    ## Do QC and normalization with dreamlet with only the selected celltypes, this is done because dreamlet break if one celltype fails.
    data_proc = processAssays(data, ~ 1, assays = "${_celltypes}" )
    data_df = as_tibble(data_proc@`.Data`[[1]]$E)
    data_df = cbind( "#id" =  rownames(data_proc@`.Data`[[1]]$E) ,  data_df)
    data_df%>%write_delim("${_output[1]}","\t")
    data_proc%>%saveRDS("${_output[0]}")

bash: [SE_qc_1]: command not found
bash: parameter:: command not found
import: unable to open X server `' @ error/import.c/ImportImageCommand/344.
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `('
bash: parameter:: command not found
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `('
bash: input:: command not found
bash: output:: command not found
bash: task:: command not found
bash: ${ }: bad substitution
bash: syntax error near unexpected token `dplyr'
bash: syntax error near unexpected token `readr'
bash: syntax error near unexpected token `dreamlet'
bash: syntax error near unexpected token `SingleCellExperiment'
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `('
bash: .Data: command not found
bash: data_df: command not found
bash: syntax error near unexpected token `('
bash: syntax error near unexpected token `('


: 2