# Bulk RNA-seq counts normalization



## Description





The normalization step follows steps used by the GTeX pipeline. Genes are first filtered to keep genes where TPM is greater than 10% in at least 20% of the samples. They are also kept if read counts is greater than 6 in at least 20% of the samples. The filtered data is then normalized using the Trimmed Mean of M-value (TMM) method. 



## Input

1. TPM matrix and read count matrix in RNA-SeQC format
    - the first two rows should be commented text with `#` prefix.
    - the matrix should be tab delimited.
    - the matrix files should end with `gct` suffix
    - These requirements are satisfied if the inputs are outputs from [`bulk_expression_QC` pipeline](https://cumc.github.io/xqtl-pipeline/code/molecular_phenotypes/QC/bulk_expression_QC.html).
2. GTF for collapsed gene model
    - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) 
    - chromosome names must have `chr` prefix (although we can make it an option in the pipeline, currently we assume the `chr` prefix convention)
3. Meta-data to match between sample names in expression data and genotype files
    - Required input
    - Tab delimited with header
    - Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
    - **must contains all the sample name in expression matrices even if they don't existing in genotype data**

## Output

Normalized expression file in `bed` format.

## Minimal Working Example Steps

### vii. Multi-sample read count normalization

Timing <10min

TMM normalization of read counts.

**Note:** We recommend using a count-threshold default value of 6. This was changed to 1 below for the MWE.

In [14]:
!sos run bulk_expression_normalization.ipynb normalize \
    --cwd ../../output_test/normalize \
    --tpm-gct ../../output_test/rnaseqc_qc/PCC_sample_list_subset.rnaseqc.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct ../../output_test/rnaseqc_qc/PCC_sample_list_subset.rnaseqc.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf ../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --sample-participant-lookup ../../PCC_sample_subset_map_test \
    --count-threshold 1 \
    --tpm_threshold 0.1 \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm  \
    --container oras://ghcr.io/cumc/rna_quantification_apptainer:latest \
    -s force -c ../csg.yml -q neurology

INFO: Running [32mnormalize[0m: 
INFO: t903fc22d31109304 [32msubmitted[0m to neurology with job id Your job 2487615 ("job_t903fc22d31109304") has been submitted
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: Waiting for the completion of [32m1[0m task.
INFO: [32mnormalize[0m output:   [32m/restricted/projectnb/xqtl/xqtl_protocol/output_test/normalize/PCC_sample_list_subset.rnaseqc.low_expression_filtered.outlier_removed.tmm.expression.bed.gz[0m
INFO: Workflow normalize (ID=wc742b6df800ede96) is executed successfully with 1 completed step and 1 completed task.


## Troubleshooting

| Step | Substep | Problem | Possible Reason | Solution |
|------|---------|---------|------------------|---------|
|  |  |  |  |  |




## Command interface

In [2]:
!sos run bulk_expression_normalization.ipynb -h

usage: sos run bulk_expression_normalization.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  normalize

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --counts-gct VAL (as path, required)
                        gene count table
  --tpm-gct VAL (as path, required)
                        gene TPM table
  --annotation-gtf VAL (as path, required)
                        gene gtf annotation table
  --sample-participant-lookup VAL (as path, required)
                        A file to map sample ID from expression to genotype,must
                        contain two columns, sample_id and participant_id,
  

In [9]:
[global]
# Work directory & output directory
parameter: cwd = path("output")
#  gene count table
parameter: counts_gct = path
#  gene TPM table
parameter: tpm_gct = path
#  gene gtf annotation table
parameter: annotation_gtf = path
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path
parameter: tpm_threshold = 0.1
parameter: count_threshold = 6
parameter: sample_frac_threshold = 0.2
# Normalization method: TMM (tmm) or quantile normalization (qn)
parameter: normalization_method = 'tmm'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

In [2]:
[normalize]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: tpm_gct, counts_gct, annotation_gtf, sample_participant_lookup
output: f'{cwd:a}/{_input[0]:bnnn}.{normalization_method}.expression.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand = "${ }", stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout',container = container, entrypoint = entrypoint
    import pandas as pd    
    #read the sample map file
    sample_map = pd.read_table("${_input[3]}")
    duplicated = sample_map.loc[sample_map.duplicated(subset=['participant_id'])]

    if duplicated.shape[0] > 0:
        print("Duplicate samples found. Please remove duplicates from ${_input[3]} before normalizing.")
        print("Duplicates:")
        print(duplicated)
        raise ValueError
    else:
        print("No duplicates found. Proceeding with normalization...")

bash: expand = "${ }", stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout',container = container, entrypoint = entrypoint
    for i in {1..22} X Y MT; do echo chr$i; done > ${_output[0]:bnnn}.vcf_chr_list
    eqtl_prepare_expression.py ${_input[0]} ${_input[1]} ${_input[2]} \
        ${_input[3]} ${_output[0]:nnn} \
        --chrs  ${_output[0]:bnnn}.vcf_chr_list \
        --tpm_threshold ${tpm_threshold} \
        --count_threshold ${count_threshold} \
        --sample_frac_threshold ${sample_frac_threshold} \
        --normalization_method ${normalization_method} && \
    rm -f ${_output[0]:bnnn}.vcf_chr_list