# GTEx pipeline execution interface

## Preprocessing
See [this page](https://gaow.github.io/mvarbvs/doc/writeup/GTEx7_Analysis_Plan.html#Preprocessing) and [this meeting note](https://gaow.github.io/mvarbvs/doc/writeup/Meetings.html#Project-meeting-20170518) for details. 

In [2]:
%sossave prep.sos -f -x
#!/usr/bin/env sos-runner
#fileformat=SOS1.0

# Usage:
# ./prep.sos download
# ./prep.sos 

%include ResourceManagement as RM
%include Misc as MC
%include DataWrestling as DW

[global]

#
# Auxiliary steps
#

[download]
# Resource preparation
sos_run('RM.plink', workdir = CONFIG['wd'])
sos_run('RM.minimac3', workdir = CONFIG['wd'])
sos_run('RM.vcftools', workdir = CONFIG['wd'])
sos_run('RM.peer', workdir = CONFIG['wd'])
sos_run('RM.king', workdir = CONFIG['wd'])


#
# Workhorse
#

[data_summary]
input: CONFIG['genotype']
sos_run("MC.genotype_stats", workdir = CONFIG['wd'])

[genotype_preprocessing]
input: CONFIG['genotype']
sos_run("DW.vcf_by_chrom", workdir = CONFIG['wd'])

[rna_preprocessing]
input: CONFIG['rna_rpkm'], CONFIG['rna_cnts'], CONFIG['genotype'], CONFIG['sample_attr'] 
sos_run("MC.rnaseq", workdir =  CONFIG['wd'])

[pca_plot_broad]
input: '/tmp/GTExPCA.ped'
sos_run("MC.global_ancestry:2", workdir = '/tmp')

[genotype_pca_broad]
parameter: project_name = "GTEx7.dbGaP"
input: CONFIG['genotype']
sos_run("DW.vcf_by_chrom+DW.broad_to_plink+MC.LD_pruning+MC.global_ancestry", 
        workdir = CONFIG['wd'],
        project_name = project_name)

[genotype_pca_umich]
parameter: project_name = "GTEx7.Imputed"
input: CONFIG['imputed_genotype']
sos_run("DW.umich_to_plink+MC.LD_pruning+MC.global_ancestry", 
        workdir = CONFIG['wd'],
        project_name = project_name)

[genotype_pca_umich_filtered]
# Filtered imputation data removing imputed sites
input: "{}/GTEx7.Imputed.genotyped.filtered.bed".format(CONFIG['wd'])
sos_run("MC.LD_pruning+MC.global_ancestry", 
        workdir = CONFIG['wd'])

[gene_annotation: provides = "${CONFIG['rna_cnts']!n}.annotation"]
input: "${CONFIG['rna_cnts']}"
output: "${CONFIG['rna_cnts']!n}.annotation"
sos_run("MC.ensembl_annotation", workdir = CONFIG['wd'])

[genotype_formatting]
parameter: original_variants = "{}/GTEx7.dbGaP.bed".format(CONFIG['wd'])
parameter: gene_annotation = "${CONFIG['rna_cnts']!n}.annotation"
depends: original_variants
input: "{}/GTEx7.Imputed.bed".format(CONFIG['wd'])
sos_run("DW.variants_filter+DW.plink_to_hdf5_batch", 
        workdir = CONFIG['wd'], 
        include = original_variants,
        ann = gene_annotation)

[covariate_preparation]
# Covariates are: sex, platform, 3 PC and PEER factors
parameter: peer_factors = glob.glob("{}/*_PEER_covariates.txt".format("${CONFIG['wd']!a}"))
parameter: pc_file = "{}/GTEx7.Imputed.prune.pc.ped".format(CONFIG['wd'])
parameter: attr_file = CONFIG['sample_attr']
parameter: covar_file = CONFIG['phenotype']
parameter: expression_file = CONFIG['expression_db']
sos_run("DW.recode_platform + DW.covariates_to_HDF5",
        workdir = CONFIG['wd'],
        peer_factors = peer_factors,
        pc_file = pc_file,
        attr_file = attr_file,
        covar_file = covar_file,
        output_file = "{}/GTEx7.Imputed.covariates.h5".format(CONFIG['wd']))

[make_toy]
# Create a toy example
sos_run("DW.subset_HDF5_data",
        workdir = CONFIG['wd'],
        ann_file = "${CONFIG['rna_cnts']!n}.annotation",
        geno_file = "{}/GTEx7.Imputed.genotyped.filtered.cis.h5".format(CONFIG['wd']),
        expr_file = "{}/${CONFIG['rna_rpkm']!bnn}.qnorm.std.h5".format(CONFIG['wd']),
        toy_file = CONFIG['toy_prefix'],
        gene_list = CONFIG['toy_gene_list'])

### Prepare computational resource
This will download / install (most) software necessary for the analysis pipeline

In [None]:
!./prep.sos download -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 8 -j 1

### Data summary
Summary statistics on data, such as missingness statistics on genotypes. It also includes various diagnostic plots. More features will be added to this workflow as analysis develops.

In [None]:
!./prep.sos data_summary -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 8 -j 1

### Genotype imputation

In [None]:
!./prep.sos genotype_preprocessing -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 8 -j 1

The imputation step was done with [Michigan Imputation Server](https://imputationserver.sph.umich.edu) because it uses Haplotype Reference Consortium (32,914 samples) reference panel which is not publicly available otherwise. [Here is how to prepare data](https://imputationserver.sph.umich.edu/start.html#!pages/help) for this service. The prepared files are uploaded to [Michigan imputation server](https://imputationserver.sph.umich.edu). [Here is configuration](https://gaow.github.io/mvarbvs/img/UMichImputation.png) of imputation job on UMich server, and [here](https://gaow.github.io/mvarbvs/img/UMichImputationResult.pdf) is summary of the outcome.

### RNA-seq preprocessing
This workflow includes data normalization and PEER factor analysis. It results in 4 **analysis ready expression data files** in HDF5 format of different versions / organizations of the same information: emperical quantile normalized and standard normal quantile normalized, saved as a flat file or grouped by tissues. This is so far the most computational intensive step

In [None]:
!./prep.sos rna_preprocessing -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 38 -j 1

### Genotype filtering and formatting, followed by global ancestry analysis 
`genotype_pca_umich_filtered` has to be executed after both `genotype_pca_umich` and `genotype_pca_broad` are executed.

Genotypes are converted to PLINK binary format. During the conversion, imputed sites are removed and variant ID for tri-allelic sites are fixed. Also both the original and imputed data are converted for PCA analysis. The [results are compared](https://github.com/gaow/mvarbvs/issues/15#issuecomment-303814249). 

In [None]:
!./prep.sos genotype_pca_umich -c conf/20170507.conf -b ~/Documents/GTEx/bin/ -J 4
!./prep.sos genotype_pca_broad -c conf/20170507.conf -b ~/Documents/GTEx/bin/ -J 4
!./prep.sos pca_plot_broad
!./prep.sos gene_annotation -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1
!./prep.sos genotype_formatting -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1
!./prep.sos genotype_pca_umich_filtered -c conf/20170507.conf -b ~/Documents/GTEx/bin/ -J 6

### Variants annotation, cis-SNP selection and genotype formatting
Genes are annotated to chromosomal positions, and annotate variants to genes. Then for each gene, variants are selected 2MB of a gene's TSS. It results a **single analysis ready file** in HDF5 format containing ~50K groups of genotype data (gene-names).

In [None]:
!./prep.sos gene_annotation -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1
!./prep.sos genotype_formatting -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1

### Merge covariates info
Covariates for analysis we've got so far include sample phenotypes (sex), sample attributes (genotyping platform), first principle components for population structure, and PEER factors. All saved in various files.

This workflow consolidates these files and generates a **single analysis ready covariate file** in HDF5 format.

In [None]:
!./prep.sos covariate_preparation -c conf/20170507.conf -b ~/Documents/GTEx/bin/ 

### Generate a toy data-set
Finally, a toy data-set is created from the data bundle. This toy can be used for methods / pipeline development. Genes selected for the toy are the same as the [LD show-case in the mash paper](https://stephenslab.github.io/gtexresults_mash/TwoSNP/2SNP.sos) (although the workflow itself takes an arbitary list of genes). See [this table](https://stephenslab.github.io/gtexresults_mash/TwoSNP/) for motivation that these genes get selected.

In [None]:
!./prep.sos make_toy -c conf/20170507.conf -b ~/Documents/GTEx/bin -J 6 -j 1

## Simulations
Please see [this notebook](../documents/MR-ASH-Simulation.html) for interactive codes simulating expression data for given genotypes.