# Expression weight computation
## Aim
To Compute association between expression and SNP for TWAS analysis.
## Pre-requisites
Make sure you install the following software before running this notebook:
GCTA (gcta_1.93.2beta_mac)
PLINK (plink_mac_20200616)
GEMMA
Modified Fusion.compute_weights.R scripts that downloaded from this github repo.
## Note
Possibily due to paralle tasking issues. For every dataset, the first run will likely skip some random genes and produce errors message. Simply rerun the script untill no error message promt will eventually produce the desired output.


# Input and Output
## Input
--Datafile, including a gene expression table with gene name as rows and sample as column. Each gene also required at least one column specifing the chr and pos(or alternatively Start and End position), the chr column shall have the same formation as how the chromosome are specified in the genotype file. The sample names shall be the same as the sample ID in the genotype file. 

--geno-path, the path of a genotype inventory, which lists the path of all genotype file in bgen format or in plink format.

--LDREF, path to the genotype inventory, list the path of all genotype file in plink format.
--PRE_GENO The prefix of the genotype file, up to the chromosome name.

--window the region span from the specify start and end site. If the gene expression only have one position column, set the window to a large number like 5E5.

## Output

-- .wgt.Rdat The actualy weight data that are computed

-- .hsq the file containing the heritibality information for the genes

-- .All_passed_gene.hsq the file that containing the heritibality information for all the genes in this run


 








# Command interface 

In [10]:
sos run SOS_weight_cpt_template.ipynb -h

usage: sos run SOS_weight_cpt_template.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  STEP

Global Workflow Options:
  --GCTA '/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64'
                        !/bin/sh MAKE SURE FUSION.compute_weights.R IS IN YOUR
                        PATH FILL IN THESE PATHS For mac user, the mac version
                        of GCTA shall be downloaded saperately, the one came
                        with the Fusion package will not work.
  --PLINK '/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/plink_mac_20200616/plink'
  --GEMMA '/Users/haosun/Documents/WG_Reasea

# Working example
On a minimal working example (MWE) dataset that can be downloaded from the private repo:
    https://github.com/cumc/neuro-twas/blob/master/WIP/GD462.hsq_succ.test.txt 
    The LDREF data can be downloaded from the Fusion official website.

    ## Test pipeline with example datas
    sos run SOS_weight_cpt_template.ipynb \
      --GCTA "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64" \
      --PLINK `which plink` \
      --GEMMA `which gemma` \
      --Rscp  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.compute_weights.mod.R" \
      --Datafile "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/Testing/Data/GD462.hsq_succ.test.txt" \
      --wd  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/SOS" \
      --OUT_DIR "./" \
      --LDREF  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/LDREF" \
      --PRE_GENO "1000G.EUR"

# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [8]:
[global]
##!/bin/sh
# MAKE SURE FUSION.compute_weights.R IS IN YOUR PATH
# FILL IN THESE PATHS
# For mac user, the mac version of GCTA shall be downloaded saperately, the one came with the Fusion package will not work.
parameter: GCTA = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64"
parameter: PLINK = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/plink_mac_20200616/plink"
parameter: GEMMA = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/GEMMA"

# Required the customized fusion.compute_weight.mod.R script, other wise will not work
parameter: Rscp = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.compute_weights.mod.R"
# Path to the input data,must include the name of the file itself
parameter: Datafile = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/Testing/Data/GD462.hsq_succ.test.txt"
# ALTERNATIVELY: ENSURE THAT plink, gcta, gemma CAN BE CALLED FROM PATH AND REMOVE --PATH_* FLAGS BELOW

# PATH TO WORKING DIRECTORY
parameter: wd = path

# PATH TO DIRECTORY CONTAINING LDREF DATA (FROM FUSION WEBSITE or https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2)
parameter: LDREF = path
# THIS IS USED TO RESTRICT INPUT SNPS TO REFERENCE IDS ONLY


# GEUVADIS DATA WAS DOWNLOADED FROM https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/analysis_results/

# PATH TO PREFIX FOR GEUVADIS GENOTYPES SPLIT BY CHROMOSOME
# SUBSAMPLE THESE TO THE LDREF SNPS FOR EFFICIENCY
parameter: PRE_GENO = path

# Specify the column in the genexpression file that contains chromosome
parameter: chrm = 3
# Specify the column in the genexpression file that contains the position of the gene
parameter: pos = 4
# If both the start and end region are specified, then their column can be specified saperately
parameter: start = int
parameter: end = int

# Specify the column in the genexpression file that contains the name of the gene
parameter: gname = 1

# Specify the scanning window for the gene position, set default to 50000 if start = end
parameter: window = 50000


# Get the gene information from the result file
data = list(set([tuple(x.strip().split()) for x in open(Datafile).readlines()[1:] if x.strip()]))
geneinfo = [item[0:4] for item in data]
geneinfo = geneinfo[1:50]



# Actual pipeline
## Data preping
This section prepare two primers for the actual computation. 
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.

2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.

In [102]:
# Make folder structure for the pipeline
[STEP_1]
input: Datafile, for_each = "geneinfo"
output: f'{wd}/{_input:nb}_per_gene/{_input:nb}.{_geneinfo[gname-1]}.txt',
        f'{wd}/tmp/{_input:nb}.{_geneinfo[gname-1]}.pheno',
        f'{wd}/tmp/{_input:nb}.{_geneinfo[gname-1]}.bed',
        f'{wd}/tmp/{_input:nb}.{_geneinfo[gname-1]}.bim',
        f'{wd}/tmp/{_input:nb}.{_geneinfo[gname-1]}.fam'
bash: expand= "$[ ]", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    cd $[wd]
    echo $[_geneinfo[gname-1]]
    #NR="$[BATCH_START]_$[BATCH_END]"
    ##extract all the paitent names
    head -1 $[_input] | awk '{$1=$2=$3=$4=""; print substr($0,4)}' | fmt -1 > PRE_GEXPID
    # extract each gene from the data matrixs to a file
    cat $[_input] | grep $[_geneinfo[gname-1]] |  while read PARAM; do
    echo $PARAM > $[_output[0]]
    done

    CHR=`echo $[_geneinfo[chrm-1]] `
    P0=`echo $[_geneinfo[start-1]] | awk '{ print $0 - $[window] }'`
    P1=`echo $[_geneinfo[end-1]] | awk '{ print $0 + $[window] }'`
    GNAME=`echo $[_geneinfo[gname-1]] `
    OUTpheno=$[_output[1]]
    cat $[_input] | grep $[_geneinfo[gname-1]] | tr '\t' '\n' | tail -n+5 | paste PRE_GEXPID PRE_GEXPID - > $OUTpheno
    OUT=$[_output[1]:n]
    #echo $OUT
    #echo $OUTpheno
    ### Get the locus genotypes for all samples and set current gene expression as the phenotype
    $[PLINK] --bfile $[LDREF]/$[PRE_GENO].$CHR \
    --pheno $OUTpheno \
    --make-bed \
    --out $OUT \
    --chr $CHR \
    --from-bp $P0 \
    --to-bp $P1 \
    --extract $[LDREF]/$[PRE_GENO].$CHR.bim \
    --keep $OUTpheno \
    --allow-no-sex

In [None]:
# Actually computing the weight using

In [88]:
#Actual weight computation analysis
[STEP_2]
input: group_by = 5
output: f'{wd}/WEIGHTS/{_input[0]:bn}.wgt.RDat',
        f'{wd}/WEIGHTS/{_input[1]:bn}.hsq'
bash: expand= "$[ ]"
    cd $[wd]
    #echo $[_input[0]:bn]
    #echo $[_output[1]:n]
    Rscript $[Rscp] \
    --bfile $[_input[1]:n] \
    --tmp $[_input[1]:n].tmp \
    --out ./WEIGHTS/$[_output[1]:bn] \
    --verbose 0 \
    --save_hsq \
    --PATH_gcta $[GCTA] \
    --PATH_gemma $[GEMMA] \
    --models blup,lasso,top1,enet
    #
    ## Append heritability output to hsq file
    cat $[_output[1]] >> All_passed_gene.hsq
    #
    ## Clean-up just in case
    #rm -f $FINAL_OUT.hsq $OUT.tmp.*
    #
    # Remove all intermediate files
    echo "end of circle"