# Expression weight computation
## Aim

The aim of this pipeline is to calculate the TWAS weight, association between gene expression level and the SNP. The weights can then be combined with the GWAS summary statistics with the phenotype to calculate new expression trait association statistics between the genes and the phenotypes. 

## Overview

To Compute association between expression and SNP for TWAS analysis.
SNP and modulate the functional phenotypes both directly and by modulating the expression levels of genes. Therefore, the integration of expression measurements and a larger scale GWAS summary association statistics will be desirable to identify the genes associated with the targeted complex traits. 

By the application of this method, new candidates genes whose expression level is significantly associated to complex traits can be used in prediction without actually going through the expensive gene expression measurement process. As a relatively small set of gene expression and genotyping data can be used to impute the expression for a much larger set of phenotyped individuals from their SNP genotype data. 

The imputed expression can then be viewed as a linear model of genotypes with __weights based on the correlation between SNPs and gene expression__ in the training data while accounting for linkage disequilibrium (LD) among SNPs. We then correlated the imputed gene expression to the trait to perform a transcriptome-wide association study (TWAS) and identify __significant expression-trait associations__. 
 
The weights are computed via variouse models: blup, bslmm,lasso,top1 and enet. blup(best linear unbiased predictors)/bslmm(Bayesian linear mixed model) are conducted using gemma, lasso using plink,and enet(elastic net) using 	cv.glmnet function in R.

Before the weight calculation, heritability of each genes are computed using GCTA, genes with insignificant heritability were screened out

## Pre-requisites
Make sure you install the following software before running this notebook:
GCTA (gcta_1.93.2beta_mac)

PLINK (plink_mac_20200616)

GEMMA


Modified Fusion.compute_weights.R scripts that downloaded from this github repo.




# Input and Output
## Input
--gene_exp_file, including a gene expression table with gene name as rows and sample as column. Each gene also required at least one column specifing the chr and pos(or alternatively Start and End position), the chr column shall have the same formation as how the chromosome are specified in the genotype file. The sample names shall be the same as the sample ID in the genotype file. 

--geno-path, the path of a genotype inventory, which lists the path of all genotype file in bgen format or in plink format.

--genotype_file_directory, path to the genotype inventory, list the path of all genotype file in plink format.
--genotype_prefix The prefix of the genotype file, up to the chromosome name.

--window the region span from the specify start and end site for the cis-gene. If the gene expression only have one position column, set the window to a large number like 5E5.

## Output

-- .wgt.Rdat The actualy weight data that are computed

-- .hsq the file containing the heritibality information for the genes

-- .All_passed_gene.hsq the file that containing the heritibality information for all the genes in this run


 








# Command interface 

In [39]:
ls
sos run SOS_weight_cpt_template.ipynb -h

FUSION.compute_weights.R      SOS_weight_cpt_template.ipynb
PRE_GEXPID                    Untitled.ipynb
usage: sos run SOS_weight_cpt_template.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  STEP

Global Workflow Options:
  --GCTA '/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64'
                        MAKE SURE FUSION.compute_weights.R IS IN YOUR PATH FILL
                        IN THESE PATHS For mac user, the mac version of GCTA
                        shall be downloaded saperately, the one came with the
                        Fusion package will not work.
  --PLINK '/Users/haosun/Documents/WG_Re

# Working example
On a minimal working example (MWE) dataset that can be downloaded from the private repo:
    https://github.com/cumc/neuro-twas/blob/master/WIP/GD462.hsq_succ.test.txt 
    The genotype_file_directory data can be downloaded from the Fusion official website.


In [85]:
## Test pipeline with test data

sos run /Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/twas-dev/Workflow/SOS_weight_cpt_template.ipynb \
  --GCTA "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64" \
  --PLINK `which plink` \
  --GEMMA `which gemma` \
  --compute_weight_rscp  "/Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/twas-dev/Workflow/FUSION.compute_weights.R" \
  --assoc_test_rscp "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.assoc_test.R"\
  --sumstat_file "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/PGC2.SCZ.sumstats"\
  --gene_exp_file "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/GD462.hsq_succ.test.txt" \
  --wd  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/Working1" \
  --genotype_file_directory  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/LDREF" \
  --genotype_prefix "1000G.EUR" \
  --chrom 3 \
  --gene_name 1 \
  --gene_start 4 \
  --gene_end 3 \
  --window 500000

INFO: Running [32mSTEP_1[0m: Preparing the phenotype files
INFO: [32mSTEP_1[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=1) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=2) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=3) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=4) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=5) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=6) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=7) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=8) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=9) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=10) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=11) is [32mignored[0m due to saved signature
INFO: [32mSTEP_1[0m (index=12) is 

## Association test only 
If using exisiting weight, use the association test(AssocTest) workflow. A working example is shown below

In [100]:


sos run /Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/twas-dev/Workflow/SOS_weight_cpt_template.ipynb AssocTest \
  --GCTA "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64" \
  --PLINK `which plink` \
  --GEMMA `which gemma` \
  --compute_weight_rscp  "/Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/twas-dev/Workflow/FUSION.compute_weights.R" \
  --assoc_test_rscp "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.assoc_test.R"\
  --sumstat_file "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/PGC2.SCZ.sumstats"\
  --gene_exp_file "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/GD462.hsq_succ.test.txt" \
  --wd  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/Working0" \
  --genotype_file_directory  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/LDREF" \
  --genotype_prefix "1000G.EUR" \
  --chrom 3 \
  --gene_name 1 \
  --gene_start 4 \
  --gene_end 3 \
  --window 500000

INFO: Running [32mAssocTest[0m: Association test
3
/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/Working0/WEIGHTS/
Rscript /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.assoc_test.R \
--sumstats /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/PGC2.SCZ.sumstats \
--weights ./GD462.hsq_succ.test.pos \
--weights_dir ./ \
--ref_ld_chr /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/LDREF/1000G.EUR. \
--chr 3 \
--out /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/Working0/result/GD462.hsq_succ.test_3.dat 
21
/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/Working0/WEIGHTS/
Rscript /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.assoc_test.R \
--sumstats /Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/PGC2.SCZ.sumstats \
--weights ./GD462.hsq_succ.test.pos \
--weights_dir ./ \
--ref_ld_

# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [98]:
[global]
# MAKE SURE FUSION.compute_weights.R IS IN YOUR PATH
# FILL IN THESE PATHS
# For mac user, the mac version of GCTA shall be downloaded saperately, the one came with the Fusion package will not work.
parameter: GCTA = path
parameter: PLINK = path
parameter: GEMMA = path

# Required the customized fusion.compute_weight.mod.R script, other wise will not work
parameter: compute_weight_rscp = path

# the R script from fusion to conduct association test is required
parameter: assoc_test_rscp = path



# Path to the input data,must include the name of the file itself
#(Phenotype data,)
parameter: gene_exp_file = path
#(sumstats data: the GWAS result between SNP and disease)
parameter: sumstat_file = path

# PATH TO where the weight shall be stored or stored.
parameter: wd = path

# Path to where the output of association testing shall be stored, by default it is the result subdiretory of the wd .
parameter: assoc_test_result = f'{wd}/result'

# PATH TO DIRECTORY CONTAINING genotype_file_directory DATA (FROM FUSION WEBSITE or https://data.broadinstitute.org/alkesgroup/FUSION/genotype_file_directory.tar.bz2)
parameter: genotype_file_directory = path
# THIS IS USED TO RESTRICT INPUT SNPS TO REFERENCE IDS ONLY
# GEUVADIS DATA WAS DOWNLOADED FROM https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/analysis_results/

# SUBSAMPLE THESE TO THE genotype_file_directory SNPS FOR EFFICIENCY
parameter: genotype_prefix = path

# Specify the column in the genexpression file that contains chromosome
parameter: chrom = 3
# If both the start and end region are specified, then their column can be specified saperately
parameter: gene_start= int
parameter: gene_end = int

# Specify the column in the genexpression file that contains the name of the gene
parameter: gene_name = 1

# Specify the scanning window for the gene position, set default to 50000 if start = end
parameter: window = 50000


# Get the gene information from the result file
data = list(set([tuple(x.strip().split()) for x in open(gene_exp_file).readlines()[1:] if x.strip()]))
geneinfo = [item[0:4] for item in data]


# Actual pipeline
## Data preping
This section prepare two primers for the actual computation. 
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.

2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.

In [103]:
# Make the Paitient_ID File
[STEP_1]
output: f'{wd}/WEIGHTS/{gene_exp_file:bn}.pos',
        f'{wd}/PRE_GEXPID'

bash: expand = "$[ ]"
    cd $[wd]
    echo $[_output]
    echo -e "WGT\tID\tCHR\tP0\tP1" > $[_output[0]]
    #extract all the paitent names
    head -1 $[gene_exp_file] | awk '{$1=$2=$3=$4=""; print substr($0,4)}' | fmt -1 > $[_output[1]]



In [32]:
# Preparing the phenotype files 
[STEP_2]
input: gene_exp_file, for_each = "geneinfo"
output: f'{wd}/{_input:nb}_per_gene/{_input:nb}.{_geneinfo[gene_name-1]}.txt',
        f'{wd}/tmp/{_input:nb}.{_geneinfo[gene_name-1]}.pheno'

bash: expand= "$[ ]", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    cd $[wd]
    # Make sure every thing is alright
    echo $[_geneinfo[gene_name-1]]
    echo $[_geneinfo[gene_name-1]]
    echo $[_input]
    echo $[_output[0]]
    echo "end of note"
    touch $[_output[0]]
    touch $[_output[1]]
    # For every gene, extract the correspond expression levels
    grep $[_geneinfo[gene_name-1]] $[_input] > $[_output[0]:n].txt
    # Transpose the expression
    cat $[_output[0]] | tr '\t' '\n' | tail -n+5 > $[_output[1]].tmp
    # Check if the Patient name file are intact
    head -3 $[wd]/PRE_GEXPID
    # Combined the expression and patient name to create phenotype
    paste  $[wd]/PRE_GEXPID  $[wd]/PRE_GEXPID $[_output[1]].tmp > $[_output[1]:n].pheno
    

In [86]:
#Preparing the genotype file for each genes
[STEP_3]
input:  group_by = 2
        
output: f'{_input[1]:n}.bed',
        f'{_input[1]:n}.bim',
        f'{_input[1]:n}.fam'

bash: expand= "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout',allow_error=True
    ##### Get the locus genotypes for each samples
    #echo $[_input[1]:n]
    #echo $[_output[1]:n]
    # Chromosomes, gene names, and start and end side are acquired by the snippet to avoid 
    $[PLINK] --bfile $[genotype_file_directory]/$[genotype_prefix].`cat $[_input[0]] | awk '{ print $$[chrom] }'` \
    --pheno $[_input[1]] \
    --make-bed \
    --out $[_output[1]:n] \
    --chr `cat $[_input[0]] | awk '{ print $$[chrom] }'` \
    --from-bp `cat $[_input[0]] | awk '{ print $$[gene_start] - $[window] }'` \
    --to-bp `cat $[_input[0]] | awk '{ print $$[gene_end] + $[window] }'` \
    --extract $[genotype_file_directory]/$[genotype_prefix].`cat $[_input[0]] | awk '{ print $$[chrom] }'`.bim \
    --keep $[_input[1]] \
    --allow-no-sex
    touch $[_input[1]:n].bed $[_input[1]:n].bim $[_input[1]:n].fam

In [104]:
#Actual weight computation 
[STEP_4]
import os
input: group_by = 3
output: f'{wd}/WEIGHTS/{_input[0]:bn}.wgt.RDat'
skip_if(os.path.getsize(f'{_input[0]}') == 0)

bash: expand= "$[ ]",stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout', active = (os.path.getsize(f'{_input[0]}') != 0)
    cd $[wd]
    Rscript $[compute_weight_rscp] \
    --bfile $[_input[1]:n] \
    --tmp $[_input[1]:n].tmp \
    --out ./WEIGHTS/$[_output[0]:bnn] \
    --verbose 0 \
    --save_hsq \
    --PATH_gcta $[GCTA] \
    --PATH_gemma $[GEMMA] \
    --models bslmm,blup,lasso,top1,enet
    
    
    ## Creat dummy output file that will be deleted next step
    touch $[_output[0]]
    ## Append heritability output to hsq file
    cat ./WEIGHTS/$[_output[0]:bnn].hsq >> All_passed_gene.hsq
    echo "end of circle"

Failed to process step output (f'{wd}/WEIGHTS/{_input[0]:bn}.wgt.RDat'): name 'wd' is not defined


In [73]:
#Create wgt.Rdat map for Assoc_testing
[STEP_5]
import os
input: group_by = 1
skip_if(os.path.getsize(f'{_input[0]}') == 0)

bash: expand= "$[ ]", active = (os.path.getsize(f'{_input[0]}') != 0)
    cd $[wd]
    #$[_input[0]:b]_info= `$[wd]/$[_input[0]:nnnn]_per_gene/$[_input[0]:bnn].txt`
    echo $[_input[0]:b] \
    `awk '{ print $$[gene_name] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` \
    `awk '{ print $$[chrom] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` \
    `awk '{ print $$[gene_start] - $[window] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` \
    `awk '{ print $$[gene_end] + $[window] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` >> $[_input[0]:nnnn].pos


In [90]:
# Association test
[STEP_6,AssocTest]
from pathlib import Path
pos = list(set([tuple(x.strip().split()) for x in open(f'{wd}/WEIGHTS/{Path(gene_exp_file).stem}.pos').readlines()[1:] if x.strip()]))
chrom_list = list(set([item[2] for item in pos]))
chrom_list

input: sumstat_file,for_each = "chrom_list"
output:f'{assoc_test_result}/{gene_exp_file:bn}_{_chrom_list}.dat'

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
    echo $[_chrom_list]
    echo $[wd]/WEIGHTS/
    echo 'Rscript $[assoc_test_rscp] \
    --sumstats $[_input] \
    --weights ./$[gene_exp_file:bn].pos \
    --weights_dir ./ \
    --ref_ld_chr $[genotype_file_directory]/$[genotype_prefix]. \
    --chr $[_chrom_list] \
    --out $[_output] '
    
    cd $[wd]/WEIGHTS/
    Rscript $[assoc_test_rscp] \
    --sumstats $[_input] \
    --weights ./$[gene_exp_file:bn].pos \
    --weights_dir ./ \
    --ref_ld_chr $[genotype_file_directory]/$[genotype_prefix]. \
    --chr $[_chrom_list] \
    --out $[_output]
    



In [8]:
# Clean up dummy file
[STEP_7]
input: group_by = "all"
output: f'{wd}/error_gene/no_plink.txt',
        f'{wd}/error_gene/no_wgt_computed.txt'
bash: expand= "$[ ]",allow_error=True
    cd $[wd]
    find ./tmp/*.bim -size 0 -print > $[_output[0]]
    rm `find ./tmp/*.bim -size 0`
    rm `find ./tmp/*.fam -size 0`
    rm `find ./tmp/*.bed -size 0`
    find ./WEIGHTS/*.wgt.RDat -size 0 -print > $[_output[1]]
    rm `find ./WEIGHTS/*.wgt.RDat -size 0`
    

bash: 1+3: command not found



In [62]:
pwd

/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/Project/test/Working/WEIGHTS

