# ROSMAP Data Description
## Background
The data used is the ROSEMAP data set, a combination of two studies: The Religious order Study (ROS) and the Memory and Aging Project (MAP)-that recruit older individuals without known dementia and include detailed cognitive, neuroimaging and other ante-mortem phenotyping and (2) an autopsy at the time of death that includes a structured neuropathologic examination. A detailed description of the study's background can be found at

De Jager, P. L. et al. A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research. Sci. Data 5:180142 doi: 10.1038/sdata.2018.142 (2018).
Publisher’s

## Data set
The dataset of we used in the analysis is located on the csg cluster `/mnt/mfs/ctcn/datasets/rosmap/`. Currently there are two types of data that are being used: gene expression, and Whole genome sequencing. The path of the file is documented at each sections.

### Whole Genome Sequencing
The WGS data is stored at `/mnt/mfs/ctcn/datasets/rosmap/wgs/ampad/variants/snvCombinedPlink/`
There are 7176 samples being sequenced over 9536310 snps. These information is acquired by the following commands on the cluster

In [None]:
cd /mnt/mfs/ctcn/datasets/rosmap/wgs/ampad/variants/snvCombinedPlink
# For number of samples
wc *.fam
# For number of SNPs
wc *.bim

In [None]:
### Genome type list
The genome type list file is created with the following code. This file provide a diretory for the genotype file that are used for the analysis workflow to run.

In [None]:
Due to the issue: https://github.com/cumc/neuro-twas/issues/19 , a symbolic link was created and were used in the genotype_list file.

In [None]:
library(dplyr)
library(tibble)
a = tibble(
  chr = c(seq(1:23))
)%>%mutate(
  dir = paste("/mnt/mfs/ctcn/datasets/rosmap/wgs/ampad/variants/snvCombinedPlink/chr",chr,".bed", sep = ""))%>%
    select("#chr" = chr,dir)%>%
    readr::write_tsv( path = "/home/hs3163/Project/Rosmap/data/Rosmap_wgs_genotype_list.txt", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")

### Gene expression

The gene expression data for each tissues is stored at the following folders.

For dorsolateral prefrontal cortex (DLPFC), 18629 genes (filtered from 58302) from 1092 samples were sequenced.
`/mnt/mfs/ctcn/datasets/rosmap/rnaseq/dlpfcTissue/batch0-m5/values/`

For anterior commissure (AC), 19147 genes from 731 samples were sequenced
`/mnt/mfs/ctcn/datasets/rosmap/rnaseq/acTissue/mbatch1-4/values/`

For posterior cingulate cortex (PCC), 19017 genes (filtered from 58302) from samples were sequenced
`/mnt/mfs/ctcn/datasets/rosmap/rnaseq/pccTissue/mbatch1-5/values/`
The amount of samples sequenced is acquired by the following commands on the cluster

In it there is various folder as outlined below:
* /values
    * /raw: raw expected counts and TPM data
    * /filtered: filtered data after removing outlier samples and lowly expressed genes
    * /residuals: residuals after regressing out significant covariates associated with RNA-seq data.
    * geneTpmResidualsAgeGenderUnadj.txt
    * geneTpmResidualsAgeGenderAdj.txt
    * residuals of TPM data.
    Adjusted for batch and technical covariates: AgeGenderUnadj
    Adjusted for age, sex, batch, and technical covariates: AgeGenderAdj
    * /scripts: Rscripts to produce these results.
For the purpose of our analysis, we shall use the residuals value taht are ajusted for both the technical covariate and the background such as age and sex.

The information above is acquired by the following commands on cluster.

In [None]:
# To get the number of samples
head -1 /mnt/mfs/ctcn/datasets/rosmap/rnaseq/dlpfcTissue/batch0-m5/values/residuals/geneTpmResidualsPlusBaselineAgeGenderAdj.txt | wc
head -1 /mnt/mfs/ctcn/datasets/rosmap/rnaseq/acTissue/mbatch1-4/values/residuals/geneTpmResidualsAgeGenderAdj.txt | wc
head -1 /mnt/mfs/ctcn/datasets/rosmap/rnaseq/pccTissue/mbatch1-5/values/residuals/geneTpmResidualsAgeGenderAdj.txt  | wc
# To get the number of genes
head -1 /mnt/mfs/ctcn/datasets/rosmap/rnaseq/dlpfcTissue/batch0-m5/values/residuals/geneTpmResidualsPlusBaselineAgeGenderAdj.txt | wc
head -1 /mnt/mfs/ctcn/datasets/rosmap/rnaseq/acTissue/mbatch1-4/values/residuals/geneTpmResidualsAgeGenderAdj.txt | wc
head -1 /mnt/mfs/ctcn/datasets/rosmap/rnaseq/pccTissue/mbatch1-5/values/residuals/geneTpmResidualsAgeGenderAdj.txt  | wc


# Problemetic SNPs
When using the WGS data to perform TWAS fusion or any other computation that required the bed matrix to be scale() in R. 3 SNPs,7:100549543_A_G" "7:100614961_A_G" "7:100615281_A_G", on chromosome 7 have universial 1, which produce NaN in R and hence casuing error that illustrated in issue #23

### Data Pre Processing
Currently the sample names used in the gene expression file and the wgs file is not the same. A sos workflow is adopted to 
* Match the sample names of the gene expression file with that of the genotypes file
* Create a region list file for all the regions that are contained in the gene expression file. This list allow us to flexibily control what regions are included in the analysis

The Index file that were used to matched the sample name can be found here:
 `/mnt/mfs/ctcn/datasets/rosmap/wgs/ampad/qualityControl/sampleSheetAfterQc.csv`


In [None]:
sos run ~/GIT/neuro-twas/Workflow/SOS_ROSMAP_gene_exp_processing.ipynb \
    --ref "/home/hs3163/Project/Rosmap/data/patient_key_WGS.txt" \
    --name_col 2 \
    --real_name_col 1 \
    --gene_exp "/mnt/mfs/ctcn/datasets/rosmap/rnaseq/dlpfcTissue/batch0-m5/values/residuals/geneTpmResidualsAgeGenderAdj.txt" \
    --start_at 2 \
    --output "/home/hs3163/Project/Rosmap/data/gene_exp/DLPFC" \
    -J 6 -q csg -c ~/system_file/csg.yml

sos run ~/GIT/neuro-twas/Workflow/SOS_ROSMAP_gene_exp_processing.ipynb \
    --ref "/home/hs3163/Project/Rosmap/data/patient_key_WGS.txt" \
    --name_col 2 \
    --real_name_col 1 \
    --gene_exp "/mnt/mfs/ctcn/datasets/rosmap/rnaseq/pccTissue/mbatch1-5/values/residuals/geneTpmResidualsAgeGenderAdj.txt" \
    --start_at 2 \
    --output "/home/hs3163/Project/Rosmap/data/gene_exp/PCC" \
    -J 6 -q csg -c ~/system_file/csg.yml

sos run ~/GIT/neuro-twas/Workflow/SOS_ROSMAP_gene_exp_processing.ipynb \
    --ref "/home/hs3163/Project/Rosmap/data/patient_key_WGS.txt" \
    --name_col 2 \
    --real_name_col 1 \
    --gene_exp "/mnt/mfs/ctcn/datasets/rosmap/rnaseq/acTissue/mbatch1-4/values/residuals/geneTpmResidualsAgeGenderAdj.txt" \
    --start_at 2 \
    --output "/home/hs3163/Project/Rosmap/data/gene_exp/AC" &

After the processing of the data, for DLPFC, there are 852 samples, with 17891 genes.

To estimate how many of the genes passed can pass the heritability check, following command is ran.

In [None]:
R:
library(tidyverse)
region_list=read_delim("/home/hs3163/Project/Rosmap/data/gene_exp/DLPFC/geneTpmResidualsAgeGenderAdj_rename_region_list.txt","\t")
# Select 2000 genes that have not been ran.
index = sample(201:nrow(region_list),2000)
region_list_selected = region_list[index,]
region_list_selected%>%readr::write_delim("/home/hs3163/Project/est_hsq/est_hsq_region_list.txt",delim = "\t")

bash:
# To save time, compute weight without the bslmm module
nohup sos run ~/GIT/neuro-twas/Workflow/twas_fusion.ipynb compute_wgt  \
  --molecular-pheno /home/hs3163/Project/Rosmap/data/gene_exp/DLPFC/geneTpmResidualsAgeGenderAdj_rename.txt  \
  --wd /home/hs3163/Project/est_hsq \
  --genotype_list /home/hs3163/Project/Rosmap/data/Rosmap_wgs_genotype_list.txt \
  --region_list /home/hs3163/Project/est_hsq/est_hsq_region_list.txt \
  --region_name 1 \
  --data_start 2 \
  --window 500000 \
  --container /home/hs3163/system_file/twas_latest.sif \
  --model blup lasso top1 enet \
  --job_size 50\
  -J 10 -q csg -c ~/GIT/neuro-twas/template/csg.yml -s build &