# Molecular phenotype normalization
This is the normalization step for data processing pipeline for xqtl workflow, containing the generation of:
1. whole Molecular_phenotype in bed format, normalized.


### Input
The input for this workflow is:
1. 1 complete gene data table
2. 1 complete gene tmp table
3. 1 gtf table after the collapse_gene notation
4. 1 index file cross referencing the sample name in the expression and genotype
5. 1 vcf_chrom_list, provided by default, should be 1 column of chr1:chr22 chromosomes, without header 


**Violation of any of the following requirement will cause error and break the process unless noted otherwise**

Requirement for input 1 and 2:
1. Sep by "\t"
2. have 2 unneeded rows above colname
3. file name end with gct
4. can only have samples in input 4
5. req 1,2,3 already satisfy by using output from bulk_expression_qc

Requirement for 3 and 5:
1. 3 must have the same gene name format as 1 & 2, i.e. ENSG00000000003 & ENSG00000000003.1 can not coexist

    Violation will cause empty output
    
2. 3 must have the same chromosome name format as 5.
    
    Violation will cause empty output
    
3. 5 should have the same chromosome name format as the genotype data to be used.

Requirement for input 4
1. Sep by "\t"
2. must contains all the sample name in 1 & 2
3. Only 2 columns
    

### Output
For each collection, the output is 1:
1. normalized Molecular_phenotype bed file

It take care of the purple part of the following diagram


### MWE:
The mwe of input 1,2 can be generated by the mwe of bulk_expression_qc, that of input 5 can be generated by reference_data_processing
The mwe of input 3,4 can be found here:https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk

In [None]:
sos run normalization.ipynb Normalization \
--tpm_gct "mwe.low_expression_filtered.outlier_removed.processed.tpm.gct"      \
--counts_gct "mwe.low_expression_filtered.outlier_removed.processed.geneCount.gct"      \
--vcf_chr_list "vcf_chrom_list"   \
--sample_participant_lookup "sampleSheetAfterQc.txt" \
--container ./rna_quantification.sif --wd ./      \
--annotation_gtf Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf  &

In [2]:
[global]
import os
# Work directory & output directory
parameter: wd = path
# The filename namefor output data
parameter: container = 'gaow/twas'
# namefor the analysis output
parameter: name= 'ROSMAP'
# Output of bulk_expression_qc
## An gene count table
parameter: counts_gct = path
## An gene TPM table
parameter: tpm_gct = path
# An gene gtf annotation table
parameter: annotation_gtf = path
# a file containing the number of chromosome in follow up analysis
parameter: vcf_chr_list = path("./")
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""

In [None]:
[Normalization]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: counts_gct,tpm_gct, annotation_gtf
output: f'{wd}/{_input[0]:bnnnnn}.expression.bed.gz',  # For factor
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'  
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    eqtl_prepare_expression.py ${tpm_gct} ${counts_gct} ${_input[2]} \
    ${sample_participant_lookup} ${ vcf_chr_list if vcf_chr_list is not path("./") else ""} ${_output[0]:bnnn} \
    --tpm_threshold 0.1 \
    --count_threshold 1 \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm