# Expression weight computation
## Aim
To Compute association between expression and SNP for TWAS analysis.
## Pre-requisites
Make sure you install the following software before running this notebook:
GCTA (gcta_1.93.2beta_mac)
PLINK (plink_mac_20200616)
GEMMA
Modified Fusion.compute_weights.R scripts that downloaded from this github repo.
## Note
Possibily due to paralle tasking issues. For every dataset, the first run will likely skip some random genes and produce errors message. Simply rerun the script untill no error message promt will eventually produce the desired output.


# Input and Output
## Input
--gene_exp_file, including a gene expression table with gene name as rows and sample as column. Each gene also required at least one column specifing the chr and pos(or alternatively Start and End position), the chr column shall have the same formation as how the chromosome are specified in the genotype file. The sample names shall be the same as the sample ID in the genotype file. 

--geno-path, the path of a genotype inventory, which lists the path of all genotype file in bgen format or in plink format.

--genotype_file_directory, path to the genotype inventory, list the path of all genotype file in plink format.
--genotype_prefix The prefix of the genotype file, up to the chromosome name.

--window the region span from the specify start and end site. If the gene expression only have one position column, set the window to a large number like 5E5.

## Output

-- .wgt.Rdat The actualy weight data that are computed

-- .hsq the file containing the heritibality information for the genes

-- .All_passed_gene.hsq the file that containing the heritibality information for all the genes in this run


 








# Command interface 

In [1]:
ls
sos run SOS_weight_cpt_template.ipynb -h

FUSION.compute_weights.R          SOS_weight_cpt_template_mod.ipynb
SOS_weight_cpt_template.ipynb
usage: sos run SOS_weight_cpt_template.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  STEP

Global Workflow Options:
  --GCTA '/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64'
                        !/bin/sh MAKE SURE FUSION.compute_weights.R IS IN YOUR
                        PATH FILL IN THESE PATHS For mac user, the mac version
                        of GCTA shall be downloaded saperately, the one came
                        with the Fusion package will not work.
  --PLINK '/Users/haosun/Documents/WG_

# Working example
On a minimal working example (MWE) dataset that can be downloaded from the private repo:
    https://github.com/cumc/neuro-twas/blob/master/WIP/GD462.hsq_succ.test.txt 
    The genotype_file_directory data can be downloaded from the Fusion official website.

    ## Test pipeline with example datas
    sos run SOS_weight_cpt_template.ipynb \
      --GCTA "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64" \
      --PLINK `which plink` \
      --GEMMA `which gemma` \
      --compute_weight_rscp  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.compute_weights.mod.R" \
      --gene_exp_file "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/Testing/Data/GD462.hsq_succ.test.txt" \
      --wd  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/SOS" \
      --genotype_file_directory  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/genotype_file_directory" \
      --genotype_prefix "1000G.EUR"

In [1]:
   ## Test pipeline with example datas
    sos run SOS_weight_cpt_template.ipynb \
      --GCTA "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64" \
      --PLINK `which plink` \
      --GEMMA `which gemma` \
      --compute_weight_rscp  "/Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/twas-dev/Workflow/FUSION.compute_weights.R" \
      --gene_exp_file "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/Testing/Data/GD462.hsq_succ.test.txt" \
      --wd  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/SOS" \
      --genotype_file_directory  "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/LDREF" \
      --genotype_prefix "1000G.EUR" \
      --gene_start 2 \
      --gene_end 3

[91mERROR[0m: [91mFailed to execute global statement: [Errno 2] No such file or directory: '/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/Testing/Data/GD462.hsq_succ.test.txt'[0m



# Global parameter settings
The section outlined the parameters that can be set in the command interface.

In [7]:
[global]
# MAKE SURE FUSION.compute_weights.R IS IN YOUR PATH
# FILL IN THESE PATHS
# For mac user, the mac version of GCTA shall be downloaded saperately, the one came with the Fusion package will not work.
parameter: GCTA = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/gcta_1.93.2beta_mac/gcta64"
parameter: PLINK = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/plink_mac_20200616/plink"
parameter: GEMMA = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/GEMMA"

# Required the customized fusion.compute_weight.mod.R script, other wise will not work
parameter: compute_weight_rscp = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.compute_weights.mod.R"

Assoc_test_rscp="/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/FUSION.assoc_test.R"





# Path to the input data,must include the name of the file itself
#(Phenotype data,)
parameter: gene_exp_file = "/Users/haosun/Documents/WG_Reasearch_Assisstant/Fusion/install/fusion_twas-master/Testing/Data/GD462.hsq_succ.test.txt"
#(sumstats data)
parameter: sumstat_file = ""


# PATH TO WORKING DIRECTORY
parameter: wd = path

# PATH TO DIRECTORY CONTAINING genotype_file_directory DATA (FROM FUSION WEBSITE or https://data.broadinstitute.org/alkesgroup/FUSION/genotype_file_directory.tar.bz2)
parameter: genotype_file_directory = path
# THIS IS USED TO RESTRICT INPUT SNPS TO REFERENCE IDS ONLY

# GEUVADIS DATA WAS DOWNLOADED FROM https://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/analysis_results/

# PATH TO PREFIX FOR GEUVADIS GENOTYPES SPLIT BY CHROMOSOME
# SUBSAMPLE THESE TO THE genotype_file_directory SNPS FOR EFFICIENCY
parameter: genotype_prefix = path

# Specify the column in the genexpression file that contains chromosome
parameter: chrom = 3
# If both the start and end region are specified, then their column can be specified saperately
parameter: gene_start= int
parameter: gene_end = int

# Specify the column in the genexpression file that contains the name of the gene
parameter: gene_name = 1

# Specify the scanning window for the gene position, set default to 50000 if start = end
parameter: window = 50000


# Get the gene information from the result file
data = list(set([tuple(x.strip().split()) for x in open(gene_exp_file).readlines()[1:] if x.strip()]))
geneinfo = [item[0:4] for item in data]


# Actual pipeline
## Data preping
This section prepare two primers for the actual computation. 
1. The gene expression pheno type, a three column table for each genes, with the first two columns specifing the family ID and within family ID of the samples. In the current case where all samples are unrelated, the first two columns are simply sample ID. The third column is the actual gene expression value.

2. The plink trio file for each specific genes, containing only the snps corresponding th the regions whose expression are recorded. In particular, the snp are filtered according to the genetics regions outlined by Position+/-windows.

In [None]:
# Make the Paitient_ID File
[STEP_0]
bash: expand = "$[ ]"
    cd $[wd]
    #extract all the paitent names
    head -1 $[gene_exp_file] | awk '{$1=$2=$3=$4=""; print substr($0,4)}' | fmt -1 > PRE_GEXPID



In [32]:
# Make folder structure for the pipeline
[STEP_1]
input: gene_exp_file, for_each = "geneinfo"
output: f'{wd}/{_input:nb}_per_gene/{_input:nb}.{_geneinfo[gene_name-1]}.txt',
        f'{wd}/tmp/{_input:nb}.{_geneinfo[gene_name-1]}.pheno'

bash: expand= "$[ ]", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    cd $[wd]
    echo $[_geneinfo[gene_name-1]]
    ##extract all the paitent names
    cd $[wd]
    echo $[_geneinfo[gene_name-1]]
    echo $[_input]
    echo $[_output[0]]
    echo "end of note"
    touch $[_output[0]]
    touch $[_output[1]]
    grep $[_geneinfo[gene_name-1]] $[_input] > $[_output[0]:n].txt
    cat $[_output[0]] | tr '\t' '\n' | tail -n+5 > $[_output[1]].tmp
    head -3 PRE_GEXPID
    paste PRE_GEXPID PRE_GEXPID $[_output[1]].tmp > $[_output[1]:n].pheno
    

In [2]:
[STEP_2]
input:  group_by = 2
        
output: f'{_input[1]:n}.bed',
        f'{_input[1]:n}.bim',
        f'{_input[1]:n}.fam'

bash: expand= "$[ ]", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]}.stdout'
    ##### Get the locus genotypes for all samples and set current gene expression as the phenotype
    echo $[_input[1]:n]
    echo $[_output[1]:n]
    
    $[PLINK] --bfile $[genotype_file_directory]/$[genotype_prefix].`cat $[_input[0]] | awk '{ print $$[chrom] }'` \
    --pheno $[_input[1]] \
    --make-bed \
    --out $[_output[1]:n] \
    --chr `cat $[_input[0]] | awk '{ print $$[chrom] }'` \
    --from-bp `cat $[_input[0]] | awk '{ print $$[gene_start] - $[window] }'` \
    --to-bp `cat $[_input[0]] | awk '{ print $$[gene_end] + $[window] }'` \
    --extract $[genotype_file_directory]/$[genotype_prefix].`cat $[_input[0]] | awk '{ print $$[chrom] }'`.bim \
    --keep $[_input[1]] \
    --allow-no-sex

In [7]:
#Actual weight computation analysis
[STEP_3]
input: group_by = 3
output: f'{wd}/WEIGHTS/{_input[0]:bn}.wgt.RDat'

bash: expand= "$[ ]",stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout'
    cd $[wd]
    Rscript $[compute_weight_rscp] \
    --bfile $[_input[1]:n] \
    --tmp $[_input[1]:n].tmp \
    --out ./WEIGHTS/$[_output[0]:bnn] \
    --verbose 0 \
    --save_hsq \
    --PATH_gcta $[GCTA] \
    --PATH_gemma $[GEMMA] \
    --models blup,lasso,top1,enet
    
    ## Creat dummy output file that will be deleted next step
    touch $[_output[0]]
    ## Append heritability output to hsq file
    cat ./WEIGHTS/$[_output[0]:bnn].hsq >> All_passed_gene.hsq
    echo "end of circle"

In [19]:
#Clean up of the redundency file
[STEP_4]
import os
input: group_by = 1

bash: expand= "$[ ]", active = (_index == 0)
    cd $[wd]
    echo -e "WGT\tID\tCHR\tP0\tP1" > $[_input[0]:nnnn].pos 

bash: expand= "$[ ]", active = (os.path.getsize(f'{_input[0]}') != 0)
    cd $[wd]
    #$[_input[0]:b]_info= `$[wd]/$[_input[0]:nnnn]_per_gene/$[_input[0]:bnn].txt`
    echo $[_input[0]:b] \
    `awk '{ print $$[gene_name] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` \
    `awk '{ print $$[chrom] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` \
    `awk '{ print $$[gene_start] - $[window] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` \
    `awk '{ print $$[gene_end] + $[window] }' $[wd]/$[_input[0]:bnnnn]_per_gene/$[_input[0]:bnn].txt` >> $[_input[0]:nnnn].pos

bash: expand= "$[ ]", active = (os.path.getsize(f'{_input[0]}') == 0)
    cd $[wd]
    rm $[_input[0]]

In [12]:
#Create diretory for association test
#[STEP_5]
input: sumstat_file

output: f'{wd}/result'

bash: expand= "$[ ]"
    Rscript $Rscr2 \
    --sumstats $[_sumstat_file] \
    --weights $[wd]/WEIGHTS/$[gene_exp_file:bn].pos \
    --weights_dir $[wd]/WEIGHTS/ \
    --ref_ld_chr $[genotype_file_directory]/$[genotype_prefix] \
    --chr chrom_list \
    --out 

bash: 1+3: command not found

