# TWAS FUSION analysis

This notebook implements a TWAS analysis workflow using the FUSION software.

## Aim

The aim of this pipeline is to build a prediction model for gene expression using cis-SNPs by obtaining "weights" from multiple regression of gene expression levels on SNP genotypes. The weights can then be combined with the GWAS data to test for association between predicted gene expression and GWAS phenotypes. Either GWAS summary statistics or GWAS genotype + phenotype data can be used for the association testing. 

## Overview

To Compute association between expression and SNP for TWAS analysis.
SNP can modulate the functional phenotypes both directly and by modulating the expression levels of genes. 
Therefore, the integration of expression measurements and a larger scale GWAS summary association statistics will be desirable to identify the genes associated with the targeted complex traits. 

By the application of this method, new candidates genes whose expression level is significantly associated to complex traits can be used in prediction without actually going through the expensive gene expression measurement process. As a relatively small set of gene expression and genotyping data can be used to impute the expression for a much larger set of phenotyped individuals from their SNP genotype data. 

The imputed expression can then be viewed as a linear model of genotypes with __weights based on the correlation between SNPs and gene expression__ in the training data while accounting for linkage disequilibrium (LD) among SNPs. We then correlated the imputed gene expression to the trait to perform a transcriptome-wide association study (TWAS) and identify __significant expression-trait associations__. 
 
The weights are computed via variouse models: blup, bslmm,lasso,top1 and enet. blup(best linear unbiased predictors)/bslmm(Bayesian linear mixed model) are conducted using gemma, lasso using plink,and enet(elastic net) using 	cv.glmnet function in R.

Before the weight calculation, heritability of each genes are computed using GCTA, genes with insignificant heritability were screened out

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install the following software before running this notebook:
- GCTA
- PLINK
- GEMMA
- Modified `Fusion.compute_weights.R` scripts can be found [in this github repo](https://github.com/cumc/neuro-twas/blob/master/Workflow/FUSION.compute_weights.R).
- The original `FUSION.assoc_test.R` script can be found [in the author's github repo](https://github.com/gusevlab/fusion_twas).

You need to make both `Fusion.compute_weights.R` and `FUSION.assoc_test.R` executable with `Rscript` command. See [this line]() for an example.

# Input and Output
## Input
- `--gene_exp_file`, including a gene expression table with gene name as rows and sample as column. Each gene also required at least one column specifing the chr and pos(or alternatively Start and End position), the chr column shall have the same formation as how the chromosome are specified in the genotype file. The sample names shall be the same as the sample ID in the genotype file. 
- `--geno-path`, the path of a genotype inventory, which lists the path of all genotype file in bgen format or in plink format.
- `--genotype_file_directory`, path to the genotype inventory, list the path of all genotype file in plink format.
- `--genotype_prefix`, the prefix of the genotype file, up to the chromosome name.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.

**FIXME: Hao, text below are moved from your analysis code. This information should belong to input data section**

```
# PATH TO DIRECTORY CONTAINING genotype_file_directory DATA (FROM FUSION WEBSITE or https://data.broadinstitute.org/alkesgroup/FUSION/genotype_file_directory.tar.bz2)
# THIS IS USED TO RESTRICT INPUT SNPS TO REFERENCE IDS ONLY
# GEUVADIS DATA WAS DOWNLOADED FROM https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/analysis_results/
```

**FIXME: here are also some text i dug up later in your workflow codes. Please explain them up-front. We would not expect users to read this notebook beyong the "Working example " section**

```
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.
2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.
```

Also I changed your genotype file input to a `genotype_list`: An index text file with two columns of chromosome and the corresponding PLINK bed file.

## Output

-- .wgt.Rdat The actualy weight data that are computed

-- .hsq the file containing the heritibality information for the genes

-- .All_passed_gene.hsq the file that containing the heritibality information for all the genes in this run


 








# Command interface 

In [2]:
!sos run twas_fusion.ipynb -h

usage: sos run twas_fusion.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  twas_fusion
  association_test

Global Workflow Options:
  --molecular-pheno VAL (as path, required)
                        Path to the input molecular phenotype data.
  --gwas-sumstat VAL (as path, required)
                        Path to GWAS summary statistics data (association
                        results between SNP and disease in a GWAS)
  --genotype-list VAL (as path, required)
                        An index text file with two columns of chromosome and
                        the corresponding PLINK bed file.
  --region-list VAL (as path, required)
                        An index text file 

# Working example
A minimal working example (MWE) dataset that can be downloaded from the private repo:
https://github.com/cumc/neuro-twas/blob/master/TWAS_pipeline_MWE%202.zip
the genotypes file can be downloaded from the following link:
https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2

**FIXME: please upload the data to synapse.org and take it out of github. On github we don't store large datasets. Please ask me about account information for synapse.org**

The time it take to run this MWE shall be around 2 minutes. Pay extra attention to the gene_start and gene_end position  when using following command on gene_exp file that are not this MWE. Also, when there is too few or too many genes that passed the heritability check, consider increasing or decreasing the --window options. 

In [None]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5
sos run twas_fusion.ipynb twas_fusion \
  --gwas_sumstat /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/sum_stat.sumstats \
  --molecular-pheno /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/gene_exp_file.txt \
  --wd /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/Working_at \
  --genotype_list /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/ld_ref \
  --region_list /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/region_list.txt \
  --region_name 1 \
  --data_start 5 \
  --window 500000 \
  --model blup lasso top1 enet \
  --container gaow/twas

## Association test only 
If using exisiting weight, use the association test(AssocTest) workflow. A minimum working example is shown below

FIXME: please complete it.

# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [225]:
[global]
# Path to the input molecular phenotype data.
parameter: molecular_pheno = path
# Path to GWAS summary statistics data (association results between SNP and disease in a GWAS)
parameter: gwas_sumstat = path
# An index text file with two columns of chromosome and the corresponding PLINK bed file.
parameter: genotype_list = path
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
# Path to the work directory of the weight computation: output weights and cache will be saved to this directory.
parameter: wd = path('./')
# Specify the directory to save fitted weights
parameter: weights_path = f'{wd:a}/WEIGHTS'
# Path to list of weights
parameter: weights_list = f'{weights_path}/{molecular_pheno:bn}.weights_list.txt'
# Specify the column in the molecular_pheno file that contains the name of the region of interest
parameter: region_name = int
# Specify the column in the molecular_pheno file where the actual data start
parameter: data_start = int
# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of Kb
parameter: window = 50000

# Specify what model are used to compute weights.
# Notice that `blsmm` can be very resource consuming.
parameter: model = ['blsmm', 'blup' , 'lasso', 'top1', 'enet']
# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'
# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
geno_inventory = dict([x.strip().split() for x in open(genotype_list).readlines() if x.strip() and not x.strip().startswith('#')])

import os
def get_genotype_file(chrom, genotype_list, geno_inventory):
    chrom = f'{chrom}'
    if chrom.startswith('chr'):
        chrom = chrom[3:]
    if chrom not in geno_inventory:
        geno_file = "error"
    else:
        geno_file = geno_inventory[chrom]
    if not os.path.isfile(geno_file):
        # relative path
        if not os.path.isfile(f'{genotype_list:ad}/' + geno_file):
            raise ValueError(f"Cannot find genotype file {geno_file}")
        else:
            geno_file = f'{genotype_list:ad}/' + geno_file
    geno_file = geno_file.split(".b",1)[0]
    return f'{geno_file}'

## FIXME: add explanation what this step is

Comments & FIXME:

1. This step should not allow for any errors. In general, we should not bypass errors, but rather face and fix them if they are under our control. I therefore changed `cat` to `stop`. In the future please try to avoid using allow error.
2. I added `#` to the header of region_list file and skipped any lines starting with `#`. Please see global section `regions` variable.
3. Notice the use of `:a` -- please always use that to make output files absolute paths. Then it would not matter where your work directory is. You only have to do it for the very first step of the workflow.

In [225]:
[twas_fusion_1]

input: molecular_pheno, for_each = "regions"
output: f'{wd:a}/cache/{_input:bn}.{_regions[3]}.exp',
        f'{wd:a}/cache/{_input:bn}.{_regions[3]}.pheno'

R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    # Get the line number for the region in the file
    line_num = system("awk '($$[region_name]==\"$[_regions[3]]\") {print NR}' $[_input]", intern=T)
    if (length(line_num) == 0){
      stop( "Cannot find $[_regions[3]] in column $[region_name]  $[_input]")}
    yi <- data.table::fread(file = $[_input:r], skip = as.integer(line_num) - 1, nrows = 1)
    samplenames_yi <- data.table::fread(file = $[_input:r], skip = 0, nrows = 1)
    colnames(yi) <- colnames(samplenames_yi)
    readr::write_tsv(yi, path = "$[_output[0]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
    yi <- as.data.frame(yi[, $[data_start]:ncol(yi), drop = FALSE])
    yj <- rbind(colnames(yi),colnames(yi),yi)
    readr::write_tsv(as.data.frame(t(yj)), path = "$[_output[1]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

## FIXME: please document it briefly

Here we use `|| true` to bypass any errors (it ensures return status of the command is always `0`)

Notice the use of `group_with` which removes the need to extract region information from previous output files.

In [225]:
[twas_fusion_2]
input: group_by = 2, group_with = 'regions'
output: f'{_input[0]:n}.bed',
        f'{_input[0]:n}.bim',
        f'{_input[0]:n}.fam'
# look up for genotype file
geno_file = get_genotype_file(_regions[0],genotype_list,geno_inventory)

bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    ##### Get the locus genotypes for $[_regions[3]]
    plink --bfile $[geno_file] \
    --pheno $[_input[1]] \
    --make-bed \
    --out $[_output[0]:n] \
    --chr $[_regions[0]] \
    --from-bp $[int(_regions[1]) + window ] \
    --to-bp $[int(_regions[2]) - window ] \
    --keep $[_input[1]] \
    --allow-no-sex || true
    touch $[_output]

## FIXME: please document it briefly

Comment: Your modified version of `FUSION.compute_weights.R` has hard-coded path `./output` which is not desiable. I have changed it to use `dirname(output)`

In [225]:
#Actual weight computation 
[twas_fusion_3]
input: group_with = 'regions'
output: f'{weights_path}/{_input[0]:bn}.wgt.RDat'
import os
skip_if(os.path.getsize(f'{_input[0]}') == 0)

bash: expand= "$[ ]",stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout', container = container
    FUSION.compute_weights.R \
    --bfile $[_input[0]:n] \
    --tmp $[_input[0]:n].tmp \
    --out $[_output[0]:nn] \
    --verbose 0 \
    --save_hsq \
    --PATH_gcta `which gcta64` \
    --PATH_gemma `which gemma` \
    --models "$[",".join(model)]"   
    ## Creat dummy output file that will be deleted next step
    touch $[_output]

## FIXME: Please document

Comment: it's a lot cleaner to just code with SoS (using python syntax). See example below where I rewrote your R code.

Also I merged the contents of hsq files into the output list. Please fix the header of the file: replace `HSQ1` `HSQ2` and `HSQ3` with proper, informative column names.

In [225]:
[twas_fusion_4]
input: group_by = "all"
output: weights_list
import os
weight_files = [x for x in _input if not os.path.getsize(x) == 0]
regions_dict = dict([(x[3], (x[0], x[1], x[2])) for x in regions])
res = [["WGT", "ID","CHR","P0","P1","HSQ1","HSQ2","HSQ3"]]
for item in weight_files:
    name = f'{item:bnn}'.lstrip(f'.{molecular_pheno:bn}')
    hsq = open(f"{item:nn}.hsq").read().strip().split()
    res.append([str(item),name,regions_dict[name][0], regions_dict[name][1], regions_dict[name][2], hsq[1], hsq[2], hsq[3]])
with open(_output, 'w') as f:
    f.write('\n'.join(['\t'.join(x) for x in res]))

## FIXME: please document

comment: I customized a copy of `FUSION.assoc_test.R` to use geno_file directoy. This removes the need to get `geno_prefix`

In [225]:
# Association test
[twas_fusion_5, association_test]
depends: weights_list

chrom_list = list(set(get_output(f'cut -f 3 {_input} | tail -n+2').strip().split("\n")))

input: gwas_sumstat, for_each = "chrom_list"
output:f'{wd:a}/{gwas_sumstat:bn}_chr{_chrom_list}.twas.txt'

geno_file = get_genotype_file(_chrom_list, genotype_list, geno_inventory)
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    FUSION.assoc_test.R \
    --sumstats $[_input] \
    --weights $[weights_list] \
    --weights_dir / \
    --ref_ld_chr $[geno_file] \
    --chr $[_chrom_list] \
    --out $[_output]

## FIXME: please document

comment: There is no need for `rm` because we can always remove whatever under `cache` altogether.

In [225]:
# Clean up dummy file
[twas_fusion_6]
input: group_by = "all"
output: f'{wd:a}/error_no_plink.log',
        f'{wd:a}/error_no_wgt_computed.log'
bash: expand= "$[ ]"
    find $[wd:a]/cache/*.bim -size 0 -print > $[_output[0]]
    find $[weights_path]/*.wgt.RDat -size 0 -print > $[_output[1]]