# Training and testing for imputed expression
This notebook contains the codes that used to select a training vs testing set and then use the sets to evaluate how good is the prediction.

## Sample selection
The following codes are used to generate a random collection of samples that are treated as the training set. The amount of training set consists of 80% (682) of the samples.

In [None]:
R: 
library(tidyverse)
# Set the seed of the process
setseed(999)
# Load in the expression data
gene_exp = read_delim("/home/hs3163/Project/Rosmap/data/gene_exp/DLPFC/geneTpmResidualsAgeGenderAdj_rename.txt","\t")
# Randomly chose the training set
training_index = sample(2:ncol(gene_exp),ncol(gene_exp)*0.8)
training_samples = colnames(gene_exp)[training_index]
gene_exp_training = gene_exp%>%dplyr::select(gene_ID,training_samples)
gene_exp_testing = gene_exp%>%dplyr::select(-training_samples)

# 
gene_exp_training%>%readr::write_delim("/home/hs3163/Project/Rosmap/data/gene_exp/DLPFC/Training_Testing/geneTpmResidualsAgeGenderAdj_rename_training.txt",delim = "\t")
gene_exp_testing%>%readr::write_delim("/home/hs3163/Project/Rosmap/data/gene_exp/DLPFC/Training_Testing/geneTpmResidualsAgeGenderAdj_rename_testing.txt",delim = "\t")


## Gene selections
For the purpose of testing of the imputed gene expression, to avoid additional complecation, the genes that are proven to passed the heratibility gene and succefully produce a weight was selected. The genes are acquired from the test run of rosmap data (first 200 genes from the gene expression file) and the gene list acquired from the Alz data set (143 genes that are in both tables).
Further information please see the following notebook:

{link to be added}

The genes are combined into one region_list by the following commands, all of the 89 genes are unique.

In [None]:
bash:
# Combining the genes
cd /home/hs3163/Project/traning_testing/data
awk '{ print $3 " " $4 " " $5 " " $2}' /home/hs3163/Project/test/1-30/WEIGHTS/*weights_list* > TT_region_list.txt
tail -n +2  /home/hs3163/Project/test/30-200/WEIGHTS/*weights_list* | awk '{ print $3 " " $4 " " $5 " " $2}' - >> TT_region_list.txt
tail -n +2  /home/hs3163/Project/Alz/WEIGHTS/*weights_list* | awk '{ print $3 " " $4 " " $5 " " $2}' - >> TT_region_list.txt

R:
# See if the genes are unique
region_list = readr::read_delim("/home/hs3163/Project/traning_testing/data/TT_region_list.txt"," ")
length(unique(region_list$ID))

## Recomputing weight for the training set
The weights for each of the genes, using only the sampless in the training sets, are recomputed using the following commands

In [None]:
nohup sos run ~/GIT/neuro-twas/Workflow/twas_fusion.ipynb compute_wgt  \
  --molecular-pheno /home/hs3163/Project/Rosmap/data/gene_exp/DLPFC/Training_Testing/geneTpmResidualsAgeGenderAdj_rename_training.txt  \
  --wd /home/hs3163/Project/traning_testing \
  --genotype_list /home/hs3163/Project/Rosmap/data/Rosmap_wgs_genotype_list.txt \
  --region_list /home/hs3163/Project/traning_testing/data/TT_region_list.txt \
  --region_name 1 \
  --data_start 2 \
  --window 500000 \
  --container /home/hs3163/system_file/twas_latest.sif \
  --model bslmm blup lasso top1 enet \
  --job_size 1\
  -J 6 -q csg -c ~/GIT/neuro-twas/template/csg.yml -s build &

It took 14 hours for the pipeline to be ran on the 89 genes. Among them, 18 genes did not passed the heritability check. The present of new genes that fail to pass the heritability checked is anticipated. I suspect the 18 genes that did not passed the test are those with borderline heritabilities.

# Imputed expression estimation and comparison.

In [3]:
suppressMessages(library('plink2R'))
read_plink()

In [5]:
%run
[susie_1]
geno_file = "/Users/haosun/Documents/WG_Reasearch_Assisstant/"
R: expand= "$[ ]", volumes = [f'{geno_file}:{geno_file}']
   library("susieR")
   library("plink2R")
    geno = read_plink("/Users/haosun/Documents/WG_Reasearch_Assisstant/Remote_Proj/Alz/Alz_AC_SNP/cache/geneTpmResidualsAgeGenderAdj_rename.ENSG00000256294")
    X = scale(geno$bed)
    Y = geno$fam[6,]
    fitted = susie(X, Y[,1],
              L = 10,
              estimate_residual_variance = TRUE, 
              estimate_prior_variance = FALSE, 
              scaled_prior_variance = 0.2)
    Hsq = var(predict(fitted)) / var(Y)
    Hsq

HINT: Pulling docker image gaow/twas
Error in library("susieR") : there is no package called ‘susieR’
Execution halted
[91mERROR[0m: [91m[susie_1]: [0]: Executing script in docker returns an error (exitcode=1).
The script has been saved to /Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Analysis/ROSMAP_TWAS/.sos/docker_run_24279.R. To reproduce the error please run:
[0m[32mdocker run --rm  -v /Users/haosun/Documents/WG_Reasearch_Assisstant:/Users/haosun/Documents/WG_Reasearch_Assisstant -v /Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Analysis/ROSMAP_TWAS/.sos/docker_run_24279.R:/var/lib/sos/docker_run_24279.R    -t  -w=/Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Analysis/ROSMAP_TWAS -u 501:20    gaow/twas Rscript /var/lib/sos/docker_run_24279.R[0m[91m[0m


In [7]:
docker run --rm  -v /Users/haosun/Documents/WG_Reasearch_Assisstant:/Users/haosun/Documents/WG_Reasearch_Assisstant -v /Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Analysis/ROSMAP_TWAS/.sos/docker_run_24279.R:/var/lib/sos/docker_run_24279.R    -t  -w=/Users/haosun/Documents/WG_Reasearch_Assisstant/GIT/freshcopy/neuro-twas/Analysis/ROSMAP_TWAS -u 501:20    gaow/twas Rscript /var/lib/sos/docker_run_24279.R

Error in library("susieR") : there is no package called ‘susieR’
Execution halted

