# TWAS FUSION analysis

This notebook implements a TWAS analysis workflow using the FUSION software.

## Aim

The aim of this pipeline is to build a prediction model for gene expression using cis-SNPs by obtaining "weights" from multiple regression of gene expression levels on SNP genotypes. The weights can then be combined with the GWAS data to test for association between predicted gene expression and GWAS phenotypes. Either GWAS summary statistics or GWAS genotype + phenotype data can be used for the association testing. 

## Overview

__Objective__: 
    To Compute the association between expression and SNP for TWAS analysis.

__Background__:
    SNP can modulate the functional phenotypes both directly and by modulating the expression levels of genes. 
Therefore, the integration of expression measurements and a larger scale GWAS summary association statistics will help identify the genes associated with the targeted complex traits. 

__Significance__:
    By applying this method, new candidate genes whose expression level is significantly associated with complex traits can be used in prediction without actually going through the expensive gene expression measurement process. As a relatively small set of gene expression and genotyping, data can be used to impute the expression for a much larger set of phenotyped individuals from their SNP genotype data. 

__Method__:
    The imputed expression can then be viewed as a linear model of genotypes with _weights based on the correlation between SNPs and gene expression__ in the training data while accounting for linkage disequilibrium (LD) SNPs. We then correlated the imputed gene expression to the trait to perform a transcriptome-wide association study (TWAS) and identify significant expression-trait associations. 
 
The weights are computed via various models: blup, bslmm, lasso,top1, and enet. BLUP(best linear unbiased predictors)/bslmm(Bayesian linear mixed model) are conducted using gemma, lasso using plink, and enet (elastic net) using cv.glmnet function in R.

Before the weight calculation, the heritability of each gene are computed using GCTA; genes with insignificant heritability were screened out.

## Pre-requisites

We provide a container image `docker://gaow/twas` that contains all software needed to run the pipeline. If you would like to configure it by yourself, please make sure you install the following software before running this notebook:
- GCTA
- PLINK
- GEMMA
- Modified `Fusion.compute_weights.R` scripts can be found [in this github repo](https://github.com/cumc/neuro-twas/blob/master/Workflow/FUSION.compute_weights.R).
- The original `FUSION.assoc_test.R` script can be found [in the author's github repo](https://github.com/gusevlab/fusion_twas).

You need to make both `Fusion.compute_weights.R` and `FUSION.assoc_test.R` executable with `Rscript` command. See [this line]() for an example.

# Input and Output
## Input
- `--gwas_sumstat` The GWAS sum stat text file documenting the associations between the SNP and the disease. It shall contain at least four column: the SNP rsID, the effect allele, the other allele, and the Z-score describing the relationship between the SNP and the disease. 
- `--genotype_list` An index text file with two columns of chromosome and the corresponding PLINK bed file.
- `--molecular-pheno`, The text file containing the table describing the molecular phenotype. It shall have regions(genes) as rows and samples as columnes
- `--region_list` The text file with 4 columns specifying the #Chr, P0 (Start position), P1(End position) and names of regions to analyze. The name of the column is not important but the order of the columns. It is also important that the column name of the first column starts with a #. The region_list can can be generated by using another sos pipeline SOS_ROSMAP_gene_exp_processing.ipynb.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.
- `--window` the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.





**FIXME: here are also some text i dug up later in your workflow codes. Please explain them up-front. We would not expect users to read this notebook beyong the "Working example " section**
This probably is served as a self reminder of what each file do and therefore not expected the user to see, so maybe move them back?

```
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.
2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.
```



## Output

- `.wgt.Rdat` The actual weight data that are computed.
- `.weight_list.txt` The index text file recording information about each of the wgt.Rdat data, including the filename, corresponding region ID, Chromosomes, start and end position, heritability, and the SE and P-value of the heritability.
- `.dat` the actual TWAS association of the genes, a detailed description of this output are outlined here:http://gusevlab.org/projects/fusion/


 








# Command interface 

In [2]:
!sos run twas_fusion.ipynb -h

usage: sos run twas_fusion.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  twas_fusion
  association_test

Global Workflow Options:
  --molecular-pheno VAL (as path, required)
                        Path to the input molecular phenotype data.
  --gwas-sumstat VAL (as path, required)
                        Path to GWAS summary statistics data (association
                        results between SNP and disease in a GWAS)
  --genotype-list VAL (as path, required)
                        An index text file with two columns of chromosome and
                        the corresponding PLINK bed file.
  --region-list VAL (as path, required)
                        An index text file 

# Working example
A minimal working example (MWE) dataset that can be downloaded from the private repo:
https://github.com/cumc/neuro-twas/blob/master/TWAS_pipeline_MWE%202.zip
the genotypes file can be downloaded from the following link:
https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2

**FIXME: please upload the data to synapse.org and take it out of github. On github we don't store large datasets. Please ask me about account information for synapse.org**

The time it take to run this MWE shall be around 2 minutes. Pay extra attention to the gene_start and gene_end position  when using following command on gene_exp file that are not this MWE. Also, when there is too few or too many genes that passed the heritability check, consider increasing or decreasing the --window options. 

In [8]:
## Test pipeline with test data
## Switch back to abosolute path, otherwise there will be file not found error in step 5
sos run twas_fusion.ipynb twas_fusion \
  --gwas_sumstat ./data/sum_stat.sumstats \
  --molecular-pheno ./data/gene_exp_file.txt \
  --wd ./ \
  --genotype_list ./data/ld_ref_abs \
  --region_list ./data/region_list.txt \
  --region_name 1 \
  --data_start 5 \
  --window 500000 \
  --model blup lasso top1 enet \
  --container gaow/twas 
  
  


INFO: Running [32mtwas_fusion_1[0m: 
INFO: [32mtwas_fusion_1[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=1) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=2) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=3) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=4) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=5) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=6) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=7) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=8) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=9) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=10) is [32mignored[0m due to saved signature
INFO: [32mtwas_fusion_1[0m (index=11) is [32mignored[0

## Association test only 
If using exisiting weight, use the association test(AssocTest) workflow. A minimum working example is shown below

FIXME: please complete it.

In [9]:
sos run twas_fusion.ipynb association_test \
  --gwas_sumstat /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/sum_stat.sumstats \
  --molecular-pheno /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/gene_exp_file.txt \
  --wd /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/Working_at \
  --genotype_list /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/ld_ref \
  --region_list /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/data/region_list.txt \
  --region_name 1 \
  --data_start 5 \
  --output_path /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/Updated/At_output \
  --window 500000 \
  --model blup lasso top1 enet \
  --container gaow/twas

Keyboard Interrupt


In [None]:
pwd

# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [5]:
[global]
# Path to the input molecular phenotype data.
parameter: molecular_pheno = path
# Path to GWAS summary statistics data (association results between SNP and disease in a GWAS)
parameter: gwas_sumstat = path
# An index text file with two columns of chromosome and the corresponding PLINK bed file.
parameter: genotype_list = path
# An index text file with 4 columns specifying the chr, start, end and names of regions to analyze
parameter: region_list = path
# Path to the work directory of the weight computation: output weights and cache will be saved to this directory.
parameter: wd = path('./')
# Specify the directory to save fitted weights
parameter: weights_path = f'{wd:a}/WEIGHTS'
# Path to list of weights
parameter: weights_list = f'{weights_path}/{molecular_pheno:bn}.weights_list.txt'
# Path to store the output folder
parameter: output_path = f'{wd:a}/result'
# Specify the column in the molecular_pheno file that contains the name of the region of interest
parameter: region_name = int
# Specify the column in the molecular_pheno file where the actual data start
parameter: data_start = int
# Specify the scanning window for the up and downstream radius to analyze around the region of interest, in units of Kb
parameter: window = 50000
# Specify the number of jobs per run.
parameter: job_size = 2




# Specify what model are used to compute weights.
# Notice that `blsmm` can be very resource consuming.
parameter: model = ['blsmm', 'blup' , 'lasso', 'top1', 'enet']
# Container option for software to run the analysis: docker or singularity
parameter: container = 'gaow/twas'
# Get regions of interest to focus on.
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]

geno_inventory = dict([x.strip().split() for x in open(genotype_list).readlines() if x.strip() and not x.strip().startswith('#')])

import os
def get_genotype_file(chrom, genotype_list, geno_inventory):
    chrom = f'{chrom}'
    if chrom.startswith('chr'):
        chrom = chrom[3:]
    if chrom not in geno_inventory:
        geno_file = f'{chrom}'
    else:
        geno_file = geno_inventory[chrom]
    if not os.path.isfile(geno_file):
        # relative path
        if not os.path.isfile(f'{genotype_list:ad}/' + geno_file):
            raise ValueError(f"Cannot find genotype file {geno_file}")
        else:
            geno_file = f'{genotype_list:ad}/' + geno_file
    return path(geno_file)

## Partition of the molecular phenotype for each genes
This step extracts the molecular phenotype for each gene and transposes them into the formats needed in the follow-up analysis.

In [11]:
[twas_fusion_1]

input: molecular_pheno, for_each = "regions"
output: f'{wd:a}/cache/{_input:bn}.{_regions[3]}.exp',
        f'{wd:a}/cache/{_input:bn}.{_regions[3]}.pheno'
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
R: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    # Get the line number for the region in the file
    line_num = system("awk '($$[region_name]==\"$[_regions[3]]\") {print NR}' $[_input]", intern=T)
    if (length(line_num) == 0){
      stop( "Cannot find $[_regions[3]] in column $[region_name]  $[_input]")}
    yi <- data.table::fread(file = $[_input:r], skip = as.integer(line_num) - 1, nrows = 1)
    samplenames_yi <- data.table::fread(file = $[_input:r], skip = 0, nrows = 1)
    colnames(yi) <- colnames(samplenames_yi)
    readr::write_tsv(yi, path = "$[_output[0]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
    yi <- as.data.frame(yi[, $[data_start]:ncol(yi), drop = FALSE])
    yj <- rbind(colnames(yi),colnames(yi),yi)
    readr::write_tsv(as.data.frame(t(yj)), path = "$[_output[1]]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

## Construction of Plink trio for each gene
This step constructs the plink file for each gene based on the output of previous steps. Specifically it:

1. Selects only the SNPs within the start and end position of the corresponding region (gene)
2. Replaces the Phenotype value (last column) of the .fam based on the input


In [225]:
[twas_fusion_2]
input: group_by = 2, group_with = 'regions'
output: f'{_input[0]:n}.bed',
        f'{_input[0]:n}.bim',
        f'{_input[0]:n}.fam'
# look up for genotype file
geno_file = get_genotype_file(_regions[0],genotype_list,geno_inventory)
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    ##### Get the locus genotypes for $[_regions[3]]
    plink --bfile $[geno_file:an] \
    --pheno $[_input[1]] \
    --make-bed \
    --out $[_output[0]:n] \
    --chr $[_regions[0]] \
    --from-bp $[int(_regions[1]) + window ] \
    --to-bp $[int(_regions[2]) - window ] \
    --keep $[_input[1]] \
    --allow-no-sex || true
    touch $[_output]

## Computation of Weight
This step computes the association between the phenotype and each of the SNPs. Specifically it:
1. Estimates and documents the heritability (Ratio of genetics variance and phenotypic variance) of each region (gene)

2. for those regions with significant heritability
    1. Compute the association with the model specified for each SNPs
    2. Stored the output in a wgt.RData for each region.


In [225]:
#Actual weight computation 
[twas_fusion_3]
input: group_with = 'regions'
output: f'{weights_path}/{_input[0]:bn}.wgt.RDat'
import os
skip_if(os.path.getsize(f'{_input[0]}') == 0)
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]",stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout', container = container
    FUSION.compute_weights.R \
    --bfile $[_input[0]:n] \
    --tmp $[_input[0]:n].tmp \
    --out $[_output[0]:nn] \
    --verbose 0 \
    --save_hsq \
    --PATH_gcta `which gcta64` \
    --PATH_gemma `which gemma` \
    --models "$[",".join(model)]"   
    ## Creat dummy output file that will be deleted next step
    touch $[_output]

## Creation of weight index
This step creates an index documenting the information about each of the weight data, including the filename, corresponding region ID, Chromosomes, start and end position, heritability, and the SE and P-value of the heritability.

In [225]:
[twas_fusion_4]
input: group_by = "all"
output: weights_list
import os
weight_files = [x for x in _input if not os.path.getsize(x) == 0]
regions_dict = dict([(x[3], (x[0], x[1], x[2])) for x in regions])
res = [["WGT", "ID","CHR","P0","P1","Heritability","Heritability_SE","Heritability_LRT_P_val"]]
for item in weight_files:
    name = f'{item:bnn}'.lstrip(f'.{molecular_pheno:bn}')
    hsq = open(f"{item:nn}.hsq").read().strip().split()
    res.append([str(item),name,regions_dict[name][0], regions_dict[name][1], regions_dict[name][2], hsq[1], hsq[2], hsq[3]])
with open(_output, 'w') as f:
    f.write('\n'.join(['\t'.join(x) for x in res]))

## Conducting of association test
This step conduct the association test by modifying SNPs' association and traits of interests with the association between SNPs and the molecular phenotypes.

In [225]:
# Association test
[twas_fusion_5, association_test]
depends: weights_list
chrom_list = list(set(get_output(f'cut -f 3 {_input} | tail -n+2').strip().split("\n")))

input: gwas_sumstat, for_each = "chrom_list"
output:f'{output_path}/{gwas_sumstat:bn}_chr{_chrom_list}.twas.txt'

geno_file = get_genotype_file(_chrom_list, genotype_list, geno_inventory)
task: trunk_workers = 1, trunk_size = job_size, walltime = '4h',  mem = '6G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "$[ ]", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, volumes = [f'{geno_file:ad}:{geno_file:ad}']
    FUSION.assoc_test.R \
    --sumstats $[_input] \
    --weights $[weights_list] \
    --weights_dir / \
    --ref_ld_chr $[geno_file:an] \
    --chr $[_chrom_list] \
    --out $[_output]

# Documentation of dummy file
For the pipeline to run smoothly, dummy files were created in previous steps. This step finds all the dummy files and records them into a separate file.

In [225]:
# Clean up dummy file
[twas_fusion_6]
input: group_by = "all"
output: f'{wd:a}/error_no_plink.log',
        f'{wd:a}/error_no_wgt_computed.log'
bash: expand= "$[ ]"
    find $[wd:a]/cache/*.bim -size 0 -print > $[_output[0]]
    find $[weights_path]/*.wgt.RDat -size 0 -print > $[_output[1]]

In [18]:
pwd

Restarting kernel "Bash"
/Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Workflow

