# Phenotype data preprocessing

This mini-protocol documents the shared post processing step and some utilities to handle molecular phenotype files including imputations.

## Data Input

- `protocol_example/protocol_example.protein.csv`
- `output/protocol_example.protein.sample_overlap.txt`
- `reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf`

## Data Output
`output/protocol_example.protein.bed.gz` and `output/protocol_example.protein.bed.gz.tbi`

## Steps in details

### Phenotype Annotation
This step serves as annote cooresponding `chr`, `start`, `end`, and `gene_id` to genes in the original phenotype matrix. 

In [8]:
sos run pipeline/gene_annotation.ipynb annotate_coord_protein \
    --cwd output/phenotype \
    --phenoFile protocol_example/protocol_example.protein.csv \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-type gene_name \
    --sample-participant-lookup output/sample_meta/protocol_example.protein.sample_overlap.txt \
    --container containers/rna_quantification.sif

INFO: Running [32mannotate_coord_protein[0m: 
INFO: [32mannotate_coord_protein[0m is [32mcompleted[0m.
INFO: [32mannotate_coord_protein[0m output:   [32m/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.region_list[0m
INFO: Workflow annotate_coord_protein (ID=w05c59b6153d792e4) is executed successfully with 1 completed step.


The output of annotation as following:

In [2]:
zcat output/phenotype/protocol_example.protein.bed.gz | head | cut -f 1-6

#chr	start	end	ID	sample_1	sample_2
chr12	752578	752579	ENSG00000060237_Q9H4A3	0.238966360190167	-0.611171227886468
chr12	990508	990509	ENSG00000082805_Q8IUD2	-1.7263446480966	-1.86313205860919
chr12	2794969	2794970	ENSG00000004478_Q02790	-1.17242006085983	-0.938018529427372
chr12	4649113	4649114	ENSG00000139180_Q16795	-1.8025806392753	2.33608132863355
chr12	6124769	6124770	ENSG00000110799_P04275	2.28733225877204	0.369455907879097
chr12	6534516	6534517	ENSG00000111640_P04406	0.068385837672252	1.14569060082588
chr12	6852147	6852148	ENSG00000111667_P45974	-0.326438251270511	-1.14820827303759
chr12	6867118	6867119	ENSG00000111669_P60174	-0.859617481888594	1.40831244070821
chr12	6913744	6913745	ENSG00000111674_P09104	-0.162509513103512	0.160160289285627

gzip: stdout: Broken pipe


In [9]:
head output/phenotype/protocol_example.protein.region_list

#chr	start	end	ID	path
chr12	752578	752579	ENSG00000060237_Q9H4A3	/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12	990508	990509	ENSG00000082805_Q8IUD2	/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12	2794969	2794970	ENSG00000004478_Q02790	/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12	4649113	4649114	ENSG00000139180_Q16795	/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12	6124769	6124770	ENSG00000110799_P04275	/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
chr12	6534516	6534517	ENSG00000111640_P04406	/home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype

### Normalization
The ROSMAP proteomics data is already normalized. Nothing to do here.

### Mean Imputation

FIXME: mention that for eQTL it is fine to skip this.

FIXME: For this step we are still working out the best approach. This will be updated with the proper imputation command once we finalize on the simulation results (currently very much likely, using `flashier`).

### Partition by chroms

This is necessary for cis TensorQTL analysis. The output are two sets of files for chrom 21 and 22

In [5]:
sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype_by_chrom \
    --phenoFile output/phenotype/protocol_example.protein.bed.gz \
    --chrom `for i in {21..22}; do echo chr$i; done` \
    --container containers/bioinfo.sif

INFO: Running [32mphenotype_by_chrom_1[0m: 
INFO: [32mphenotype_by_chrom_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_1[0m output:   [32moutput/phenotype_by_chrom/protocol_example.protein.bed.chr22.bed.gz output/phenotype_by_chrom/protocol_example.protein.bed.chr21.bed.gz in 2 groups[0m
INFO: Running [32mphenotype_by_chrom_2[0m: 
INFO: Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO: NumExpr defaulting to 8 threads.
INFO: [32mphenotype_by_chrom_2[0m is [32mcompleted[0m.
INFO: [32mphenotype_by_chrom_2[0m output:   [32moutput/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.txt[0m
INFO: Workflow phenotype_by_chrom (ID=wf1fa19fa67981064) is executed successfully with 2 completed steps and 3 completed substeps.
