# Phenotype data formatting

This module implements a collection of workflows used to format molecular phenotype data.

**FIXME: this entire pipeline needs to be improved**

## Input
The input for this workflow is the collection of data for 1 molecular phenotype as described in the format of:

1. a complete residualized (covariates regressed out) molecular phenotype data 
2. a region list

These input are outputs from previous pipelines such as `covariate_preprocessing` and `gene_annotation`.

## Output

1. A list of phenotype file (bed+index) for each chrom, annotated with genomic coordiates, suitable for TensorQTL analysis.
2. 1 lists of phenotype file (bed+index) for each gene,  annotated with genomic coordiates, suitable for fine-mapping.

## Minimal working example
An MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1yjTwoO0DYGi-J9ouMsh9fHKfDmsXJ_4I?usp=sharing).
The singularity image (sif) for running this MWE is uploaded to [google drive](https://drive.google.com/drive/folders/1mLOS3AVQM8yTaWtCbO8Q3xla98Nr5bZQ)

**FIXME: these have to be updated to synapse links. Examples below also need updates**


In [None]:
sos run pipeline/phenotype_formatting.ipynb partition_by_chrom \
    --cwd output  \
    --phenoFile MWE.log2cpm.mol_phe.bed.gz \
    --region-list ROSMAP_PCC.methylation.M.renamed.region_list \
    --container containers/rna_quantification.sif

In [None]:
sos run pipeline/phenotype_formatting.ipynb partition_by_chrom \
    --cwd mQTL_perchrom  \
    --phenoFile ROSMAP_arrayMethylation_covariates.sesame.methyl.beta.sample_matched.bed_BMIQ.bed.filter_na.bed.softImputed.bed.gz \
    --region-list ROSMAP_PCC.methylation.M.renamed.region_list \
    --container containers/rna_quantification.sif

In [2]:
[global]
import os
# Work directory & output directory
parameter: cwd = path("output")
# The filename namefor output data
parameter: container = ''
parameter: entrypoint={('micromamba run -n' + ' ' + container.split('/')[-1][:-4]) if container.endswith('.sif') else f''}
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Path to the input molecular phenotype data.
parameter: phenoFile = paths
# name for the analysis output
parameter: name= f'{phenoFile:bn}'
# Whether the input data is named by gene_id or gene_name. By default it is gene_id, if not, please change it to gene_name
parameter: phenotype_id_type = 'gene_id'
gene_name_as_phenotype_id = "gene_name" == phenotype_id_type

## Region List generation

To partitioning the data by genes require a region list file which:

    1. have 5 columns: chr,start,end,gene_id,gene_name
    2. have the same gene as or less gene than that of the bed file
    
Input:

    1. A gtf file used to generated the bed
    2. A phenotype bed file, must have a gene_id column indicating the name of genes.    

## Process of molecular phenotype file
This workflow produce a bed+tabix file for all the molecular pheno data that are included in the region list to feed into downstream analysis

In [None]:
[partition_by_chrom_1]
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze, can be made on fly with <(zcat {phenoFile}.bed.gz | cut -f 1,2,3,4 )
parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract(lst):
    return [item[0] for item in lst]
chrom = list(set(extract(regions)))
# Path to the input molecular phenotype data.
input: phenoFile ,for_each = "chrom"
output: f'{cwd}/{name}.{_chrom}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand = "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container, entrypoint=entrypoint
    zcat $[_input] | head -1 > $[_output:n]
    tabix $[_input] $[_chrom] >> $[_output:n] 
    bgzip -f $[_output:n]
    tabix -p bed $[_output] -f
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[partition_by_chrom_2]
# Path to the input molecular phenotype data.
input: group_by = "all"
output: f'{cwd}/{name}.per_chrom.recipe'
import pandas as pd
chrom = [str(x).split(".")[-3].replace("chr","") for x in _input]
chrom_df = pd.DataFrame({"#id" : chrom ,"#dir" : _input})
chrom_df.to_csv(_output,index = 0,sep = "\t")

In [None]:
[partition_by_gene_1]
# An index text file with 5 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract(lst):
    return [item[0] for item in lst]
chrom = list(set(extract(regions)))
# Path to the input molecular phenotype data.
input: phenoFile ,for_each = "regions"
output: f'{cwd}/{name}.{_regions[3]}.{_regions[4]}.mol_phe.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand = "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout',container = container, entrypoint=entrypoint
    zcat $[_input] | head -1 > $[_output:n]
    zcat $[_input] | grep  $[_regions[3] if gene_name_as_phenotype_id else _regions[4]] >> $[_output:n]
    bgzip -f $[_output:n]
    tabix -p bed $[_output] -f
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_column:" `zcat $i | grep -v "##"   | head -1 | wc -w `   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

In [None]:
[partition_by_gene_2]
input: group_by = "all"
output: f'{cwd}/{name}.per_gene.recipe'
import pandas as pd
region_df = pd.DataFrame({"#id" : [x[3] for x in regions] ,"dir" : _input})
region_df.to_csv(_output,index = 0,sep = "\t")

FIXME: isn't it just the partition by gene code?

In [None]:
[bam_subsetting]
parameter: region = "chr21 chr22"
input: phenoFile , group_by = 1
output: f'{cwd}/{_input:bn}.subsetted.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    samtools view -b ${_input} ${region} > ${_output}

FIXME: This part should go to the rna-seq callign pipeine

In [None]:
[bam_to_fastq]
input: phenoFile, group_by = 1
output: f'{cwd}/{_input:bn}.1.fastq',f'{cwd}/{_input:bn}.2.fastq'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    samtools sort -n ${_input} -o ${_output[0]:nn}.sorted.bam
    bedtools bamtofastq -i ${_output[0]:nn}.sorted.bam -fq ${_output[0]} -fq2 ${_output[1]}

In [None]:
# Extract samples from expression data generated by RNASeQC
[gct_extract_samples]
parameter: phenoFile = path
parameter: keep_samples = path
input: phenoFile
output: f'{_input[0]:nn}.sample_matched.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = "$[ ]", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container
    library("dplyr")
    library("readr")
    phenoFile = read_delim($[_input[0]:ar], "\t", col_names = T, comment = "#")
    sample_lookup = read_delim($[keep_samples:ar], "\t" ,col_names = T, comment = "#")
    ## Make phenoFile consistant with sampleLookup, remove samples by select()
    int = intersect(colnames(phenoFile),unlist(sample_lookup[,1]))
    phenoFile_tmp = phenoFile%>%select(c(colnames(phenoFile)[1],all_of(int)))
    ## Add 2 header lines, https://github.com/getzlab/rnaseqc/blob/286f99dfd4164d33014241dd4f3149da0cddf5bf/src/RNASeQC.cpp#L426
    cat(paste("#1.2\n#", nrow(phenoFile_tmp), ncol(phenoFile_tmp) - 2, "\n"), file=$[_output:nr], append=FALSE)
    phenoFile_tmp%>%write_delim($[_output:nr],delim = "\t",col_names = T, append = T)