# Pseudobulk mega phenotype data QC and normalization
It is based on Nick's code and normalization method from [this page](https://bigomics.ch/blog/why-how-normalize-rna-seq-data/). Should be optimized for general use. 

The main idea is to merge the 3 dataset of snRNA pseudobulk from `phils' 1st version(masasshi)`, `kelli` and `phils' 2nd version` so as to get the mega data with more samples(~800) to be more powerful to download QTL analysis. First do the normalization within dataset, then `removeBatchEffect()` function from `limma` package was used to remove the batch effect from these 3 data.

## Input
The input is pseudo bulk eqtl phenotype data of raw count matrix. In this notebook, we use the following files as input:
phenotype original file:
1) De Jager batch:  
`/mnt/vast/hpc/homes/al4225/pseudobulk_phil_old/cell_expr_sampleid/snuc_pseudo_bulk.{tissue}.count_matrix`
2) Kellis batch:   
`/mnt/vast/hpc/homes/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/cell_expr_sampleid/snuc_pseudo_bulk.{tissue}.count_matrix`
3) De Jager new batch:   
`/mnt/vast/hpc/homes/al4225/pseudobulk_phil_new/cell_expr/snuc_pseudo_bulk.{tissue}.count_matrix`

For Ast, Inh, Oli, OPC, Mic, Exc, the input is separate raw count matrix(1st col is gene name, 2nd col to the end are sampleid with raw counts expression value), for each specific celltype. So we list the celltype name as 1st col, phil's 1st data path as the 2nd col, kelli's data as the 3rd col, phil's 2nd data as the 4th col in a txt file as the input. e.g.     
`Ast	/mnt/vast/hpc/homes/al4225/pseudobulk_phil_old/cell_expr_sampleid/snuc_pseudo_bulk.Ast.count_matrix	/mnt/vast/hpc/homes/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/cell_expr_sampleid/snuc_pseudo_bulk.Ast.count_matrix	/mnt/vast/hpc/homes/al4225/pseudobulk_phil_new/cell_expr/snuc_pseudo_bulk.Ast.count_matrix`


## Steps:
-- `Count Cells by Sample`: Calculate the number of cells for each sample using metadata. This helps in filtering samples based on cell count.    
-- `Filter Samples`: Exclude samples with fewer than 10 cells to ensure sufficient data quality and representativeness.
-- `Select samples`: Include all samples from phil's 1st data, then keep new samples from kelli's data based on phil's 1st data and get the merged12 data, and them keep new samples from phil's 2nd data based on the merged12 data.
-- `Gene Filtering`: Use `filterByExpr()` to retain genes with sufficient expression across samples, improving the reliability of statistical tests.    
-- `Normalization`: Apply TMM normalization to adjust for composition effects, making counts between samples comparable.    
-- `Voom Transformation`: Transform count data to log2-counts per million (logCPM), stabilizing variance across genes.   
-- `Filter by Expression`: Remove genes with mean log2CPM < 2.0 to focus on genes with significant expression levels.    
-- `Remove batch effect:` `removeBatchEffect()` function from `limma` package was used to remove the batch effect from these 3 data.     
-- `Quantile normalization`: Apply quantile normalization to ensure that the distribution of expression values is consistent across samples.    

## Output

The output is a normalized.log2cpm.tsv file, with 1st column  `id` as gene name, then the sampleids as following columns.      
**1) Normalized.log2cpm.tsv file**: `/home/al4225/pseudobulk_merge/mega_quantnorm_SM/snuc_pseudo_bulk.{tissue}.mega.normalized.log2cpm.tsv`    
**2) Raw count matrix**(keep all samples of phil's old version, then keep new kelli's sample, then keep new samples of phil's new version): `/home/al4225/pseudobulk_merge/cellcounts_mega/snuc_pseudo_bulk.{tissue}.mega.count_matrix`    
**3) Cell counts of each sample**: `/home/al4225/pseudobulk_merge/ncells_mega/snuc_pseudo_bulk.{tissue}.mega.nCells`    

## Global parameter settings

In [None]:
[global]
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 5
# Wall clock time expected
parameter: walltime = "20h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 2

## Phenotype merging and QC
Before merging, should make sure the sample names from different dataset are using the same format.

sos run pipeline/pseudobulk_mega_expression_QC_and_normalization.ipynb mergedata \
    --name snuc_pseudo_bulk \
    --file_paths /mnt/vast/hpc/homes/al4225/pseudobulk_merge/celltypes_3.txt \
    --cwd /mnt/vast/hpc/homes/al4225/pseudobulk_merge/mega_quantnorm_SM/ \
    --container /home/al4225/project/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/container/seurat.sif \
    --mem 80G -J 50 -c /mnt/vast/hpc/csg/molecular_phenotype_calling/csg.yml -q csg

In [None]:
[mergedata]
# data: 1st: phil's old, 2nd: kelli, 3rd:phils new
import pandas as pd
# load the input txt file with celltype name and paths
parameter: file_paths = path()

#for each tissue.
input_file_paths = pd.read_csv(file_paths, sep = "\t", header=None)
print(input_file_paths)
input_inv = input_file_paths.values.tolist()
tissue_id_inv = [x[0] for x in input_inv]
input_files = [x[1:] for x in input_inv]

print("\ntissue ID List:")
print(tissue_id_inv)
print("\nFile List:")
print(input_files)
print("Length of tissue_id_inv:", len(tissue_id_inv))
print("Length of group by ts:", len(input_files[0]))
input: input_files, group_by =len(input_files[0]), group_with = "tissue_id_inv" 
output: normalized_log2cpm = f'{cwd:a}/{name}.{_tissue_id_inv}.mega.normalized.log2cpm.tsv'
#output: gene_expression_matrix = f'{cwd:a}/{name}.{_tissue_id_inv}.mega.count_matrix' # raw count matrix output
#output: cell_counts = f'{cwd:a}/{name}.{_tissue_id_inv}.mega.nCells' # cell counts of each sample output
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand = '${ }', stdout = f"{_output:0}.stdout", stderr = f"{_output:0}.stderr", container = container, entrypoint = entrypoint
library(Seurat)
library(edgeR)
library(limma)
library(dplyr)

# phils' version1
expr1 <- read.table(${_input[0]:r}, header = TRUE, row.names = 1, check.names=F)
cat("load phils version1:\n")
head(expr1)[,1:10]

# Kelli's data
expr2 <- read.table(${_input[1]:r}, header = TRUE, row.names = 1, check.names=F)
cat("load kelli data:\n")
head(expr2)[,1:10]

# phil's new data
expr3 = read.table(${_input[2]:r}, sep="\t", header=T, check.names=F)
cat("load phils new data version2:\n")
head(expr3)[,1:10]

genes1 = rownames(expr1)
genes2 = rownames(expr2)
genes3 = rownames(expr3)

# Find common genes among all three sets
common_genes = Reduce(intersect, list(genes1, genes2, genes3))

# Filter the expression matrices to keep only common genes
expr1 = expr1[common_genes, ]
expr2 = expr2[common_genes, ]
expr3 = expr3[common_genes, ]

sample1 = colnames(expr1)
sample2 = colnames(expr2)
sample3 = colnames(expr3)
common_samples = intersect(sample1, sample2)
cat("common samplesof expr1 and expr2:\n")
print(common_samples)
# remove the common samples in expr2
expr2 <- expr2[, !(colnames(expr2) %in% common_samples)]
expr12 <- cbind(expr1,expr2)
common_samples12_3 = intersect(colnames(expr12), sample3)
# remove the common samples in expr3
expr3 <- expr3[, !(colnames(expr3) %in% common_samples12_3)]
expr_raw <- cbind(expr12,expr3)
#write.table(expr_raw, file = "${_output['gene_expression_matrix']}", sep = "\t", row.names = TRUE, quote = FALSE, col.names = TRUE)

# get the ncell df
sample_ids <- colnames(expr_raw)
cell_counts <- colSums(expr_raw)
cellcounts <- data.frame(sampleid = sample_ids, ncell = cell_counts)
head(cellcounts)
#write.table(cellcounts, file = "${_output['cell_counts']}", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)

#filtering out samples with fewer than 10 cells in a celltype
col_sums <- colSums(expr1, na.rm = TRUE)
print(paste("Original data frame dimensions of expr1:", dim(expr1)))
keep_cols <- names(col_sums)[col_sums > 9]
expr1 <- expr1[keep_cols]

col_sums <- colSums(expr2, na.rm = TRUE)
keep_cols <- names(col_sums)[col_sums > 9]
expr2 <- expr2[keep_cols]

col_sums <- colSums(expr3, na.rm = TRUE)
keep_cols <- names(col_sums)[col_sums > 9]
expr3 <- expr3[keep_cols]

y1 <- DGEList(counts = expr1)
keep <- filterByExpr(y1)
y1 <- y1[keep,,keep.lib.sizes=F]
#counts per million
y1 <- calcNormFactors(y1, method = "TMM")
v1 <- voom(y1, plot=F)
logcpm1 <- v1$E

y2 <- DGEList(counts = expr2)
keep <- filterByExpr(y2)
y2 <- y2[keep,,keep.lib.sizes=F]
#counts per million
y2 <- calcNormFactors(y2, method = "TMM")
v2 <- voom(y2, plot=F)
logcpm2 <- v2$E

y3 <- DGEList(counts = expr3)
keep <- filterByExpr(y3)
y3 <- y3[keep,,keep.lib.sizes=F]
#counts per million
y3 <- calcNormFactors(y3, method = "TMM")
v3 <- voom(y3, plot=F)
logcpm3 <- v3$E

genes1 = rownames(logcpm1)
genes2 = rownames(logcpm2)
genes3 = rownames(logcpm3)

# Find common genes among all three sets
common_genes = Reduce(intersect, list(genes1, genes2, genes3))

# Filter the expression matrices to keep only common genes
logcpm1 = logcpm1[common_genes, ]
logcpm2 = logcpm2[common_genes, ]
logcpm3 = logcpm3[common_genes, ]

batch <- c(rep("Batch_phil1", ncol(logcpm1)), rep("Batch_kelli", ncol(logcpm2)), rep("Batch_phil2", ncol(logcpm3)))

logcpm <- cbind(logcpm1, logcpm2,logcpm3)
# remove genes if mean log2CPM < 2.0
mean_logcpm <- apply(logcpm, 1, mean)
logcpm <- logcpm[mean_logcpm > 2.0,]

# remove batch effect
logcpm <- removeBatchEffect(logcpm, batch=batch)
logcpm <- as.data.frame(logcpm)
logcpm$id <- rownames(logcpm)
rownames(logcpm) <- NULL
logcpm <- logcpm[, c("id", setdiff(names(logcpm), "id"))]

# convert log2CPM to matrix
logcpm_id <- logcpm$id
logcpm <- as.matrix(logcpm[, colnames(logcpm) != "id"])
rownames(logcpm) <- logcpm_id

# quantile normalizarion
logcpm <- t(apply(logcpm, 1, rank, ties.method = "average"))
logcpm <- qnorm(logcpm / (ncol(logcpm) + 1))

# export
df <- data.frame(id = rownames(logcpm), logcpm, check.names = F)
write.table(df, file="${_output['normalized_log2cpm']}", sep="\t", quote = F, row.names = F)

cat("the mega normalized pseudo_bulk_eqtl tsv are saved")