# BNFO62: Population Genetics, Day 2

**Author:** Tiffany Amariuta

#### Copy over the data

In [None]:
system("cp -r ~/public/popgen/InClass_Day2 ~/module-10-popgen/Day2/InClass_Day2")

#### Load required libraries
Check that these are installed without error. Note, it is okay if plink2R has a warning about Rcpp, the scripts will still run. 

In [None]:
library(data.table)
library(plink2R) #I followed this: https://github.com/gabraham/plink2R/issues/1#issuecomment-1337177621
library(glmnet) #install.packages("glmnet")
#library(RcppEigen) #failed to install from source. 

#check that these are installed (needed for FUSION)
library("optparse")
library('plink2R')
library('glmnet')
library('methods')

To prepare this workshop, I have selected a couple genes associated with Alzheimer's disease and Ulcerative Colitis. These genes will be the focus of our TWAS analysis. 

Our first task will be to run FUSION (http://gusevlab.org/projects/fusion/) to build gene expression prediction models for our TWAS analysis. 

First, let's explore the gene expression and genotype reference files that I've provided. Please answer question 1 in the chunk below.  

In [None]:
gene1 <- "ENSG00000158864.12" #a gene that is implicated in Alzheimer's
gene2 <- "ENSG00000116704.7" #a gene that is implicated in Ulcerative Colitis

list.files()

In [None]:
fam_file <- fread("InClass_Day2/1000G.EUR.1.fam", header = T) #people
head(fam_file)

In [None]:
bim_file <- fread("InClass_Day2/1000G.EUR.1.bim", header = T) #SNPs
head(bim_file)

In [None]:
bed_file <- fread("InClass_Day2/1000G.EUR.1.bed", header = T) #genotypes; this will not work because the file is binary, so we use read_plink()
head(bed_file)

In [None]:
genotypes = read_plink("InClass_Day2/1000G.EUR.1",impute="avg") #notice how we don't provide the suffix; read_plink needs the entire trio to read
#type this and press tab after the $: head(genotypes$)

In [None]:
head(genotypes$fam)

In [None]:
head(genotypes$bim)

In [None]:
genotypes$bed[1:5,1:5]

In [None]:
dim(genotypes$bed) 

In [None]:
nrow(genotypes$fam)

In [None]:
nrow(genotypes$bim)
#therefore in bed file genotype matrix, people on the rows, SNPs on the columns

#### Question 1: What do the numbers 0, 1, and 2 represent? 

ANSWER 1: 

Let's take a look at the alleles in the genotype file. The minor allele frequency is defined as the proportion of alleles seen in the population that are minor. Individuals with genotype = 2 have 2 minor alleles; Individuals with genotype = 1 have 1 minor and 1 major allele; Individuals with genotype = 0 have 2 major alleles. 

**Exercise 1**: How many rare alleles are there (minor allele frequency < 1%)? How many common alleles are there (minor allele frequency > 5%)? A sanity check that you are doing this analysis right is that if you plot a histogram of the MAF, you will see values ranging from 0 to 0.5. 

In [None]:
#### YOUR CODE HERE ####

Let's examine the gene expression data. These are gene expression matrices in two tissues relevant to Alheimer's and Ulcerative Colitis. Please answer question 2 in the chunk below. 

In [None]:
blood_gene_expression <- fread("InClass_Day2/GTEx_matrix_blood.txt", header = T)
blood_gene_expression[1:5,1:10]

brain_gene_expression <- fread("InClass_Day2/GTEx_matrix_brain.txt", header = T)
brain_gene_expression[1:5,1:10]

#### Question 2: What are on the columns? What are on the rows?

ANSWER 2: 

*Note - I took a sample of genes for this exercise; ordinarily there will be ~20K genes in total per tissue.*

In [None]:
#### YOUR CODE HERE ####

Let's examine the genotype data for the people for which gene expression was measured. I previously extracted the SNPs in a cis window (+/- 500 kb) around our genes of interest. Let's look at one of these genes: ENSG00000158864.12. Most genes will be expressed in multiple tissues, e.g. in both blood and brain tissue that we are considering in this exercise. But not all individuals will have donated each type of tissue, so there will be different individuals in the brain versus blood datasets. Given this, please answer the question in the chunk below. 

In [None]:
GTEx_genotypes = read_plink("InClass_Day2/ENSG00000158864.12_brain",impute="avg") 
GTEx_genotypes$bed[1:5,1:5]
head(GTEx_genotypes$fam)
head(GTEx_genotypes$bim)

In [None]:
# Question2 : How many individuals are in common between the brain dataset and the blood dataset? 

#### YOUR CODE HERE ####

In today's exercise, we will be checking for an association with each of these genes with Alzheimer's and Ulcerative Colitis. Since each of these genes are expressed in blood and brain tissue, we will be making 8 associations (all combinations of gene 1 vs. 2, tissue 1 vs. 2, disease 1 vs. 2) 

As we learned in lecture, it is very important to consider the relevant tissue context for a disease. If a gene has a role in Ulcerative Colitis, it is more likely going to act via blood tissue than via brain tissue. Likewise, if a gene has a role in Alzheimer's it is more likely going to act via brain tissue than blood tissue. We can test this hypothesis. Moreover, we can test the hypothesis that these genes have different eQTL models in different tissues. 

For each gene, we will run FUSION using both blood and brain gene expression data. 

Because we are skipping heritability estimation with GCTA, we need to comment out lines 150-160 in FUSION.compute_weights.R; please do this now.

Let's first get familiar with running FUSION and supplying relevant arguments. Please answer Question 3 below. 


In [None]:
gene1 <- "ENSG00000158864.12" #a gene that is implicated in Alzheimer's
gene2 <- "ENSG00000116704.7" #a gene that is implicated in Ulcerative Colitis
tissue1 <- "brain"
tissue2 <- "blood"

gene <- gene1
tissue <- tissue1

#We can use the system() function to run unix/bash commands in R. 
#I have downloaded the fusion software to a directory that is in the same parent directory as our current working directory "InClass_Day2".

system(paste0("Rscript FUSION.compute_weights_classedit.R --bfile InClass_Day2/",gene,"_",tissue," --covar InClass_Day2/covar_",tissue,".txt --tmp InClass_Day2/tmp/tmp_",gene,
              " --out GeneExpressionModel_",gene,"_",tissue," --hsq_set 0.05 --models top1,lasso,enet --verbose 2 > log_",gene,"_",tissue,".txt"))

#### Question 3: How much variance of gene expression of ENSG00000158864.12 in brain tissue is explained by covariates? Hint: check the log file which will be created in our current working directory. 

ANSWER 3: 

Let's check out what FUSION did. 

In [None]:
logfiles <- list.files(pattern = "log_")
y <- fread(logfiles[1])

#Proportion of gene expression explained by covaraites was computed (this is then regressed out of gene expression variable).

#Then PLINK was run to update the phenotype (gene expression) variable with residualized gene expression values. 

#Then, heritability estimation was skipped "### Skipping heritability estimate"

#Now gene expression models (3 types) are being fit via cross-validation. 

When we ran the FUSION compute_weights script, we also generated an Rdat file. Let's check this out to see what it contains. Please answer the question below.  

In [None]:
load(paste0("GeneExpressionModel_",gene,"_",tissue,".wgt.RDat"))

GTEx_genotypes_brain = read_plink(paste0("InClass_Day2/",gene,"_",tissue),impute="avg")
nrow(GTEx_genotypes_brain$bim) #there are 349 SNPs in the cis window of this gene

#The files in the Rdat object were: 
wgt.matrix
cv.performance
hsq
hsq.pv

In [None]:
dim(wgt.matrix) #there are 349 rows (SNPs)
head(wgt.matrix) #we ran 3 different types of eQTL models here. 

#### Question 4. How many SNPs (features) were found to be predictive in the lasso model? In the enet model? 

In [None]:
#Answer 4. 

#### YOUR CODE HERE ####

Let's complete generating the eQTL models (weights) for the other combinations of genes and tissues, which are inputs to the TWAS analysis we will do next. 

In [None]:
#### YOUR CODE HERE ####

Now we have almost all the input files we need to run TWAS. The last step is to make a POS file that tells TWAS which genes we want to analyze and where their weight files exist. 

In [None]:
#POS file has this format: PANEL   WGT     ID      CHR     P0      P1      N
pos_file <- matrix(0,nrow = 4, ncol = 7) #2 tissues x 2 genes; initialize empty matrix

#PANEL is the tissue
#WGT is the path to the gene expression model (Rdat files)
#ID is the gene / tissue name
#For CHR, P0 (start), P1 (end): we obtain these for each gene from either gene expression file. 

ge_blood <- fread("InClass_Day2/GTEx_matrix_blood.txt", header = T)
ge_brain <- fread("InClass_Day2/GTEx_matrix_brain.txt", header = T)
colnames(ge_blood)[1] <- "chr"
colnames(ge_brain)[1] <- "chr"

row_count <- 1
for (i in 1:2){ #count gene
  gene <- get(paste0("gene",i))
  for (j in 1:2){ #count tissue
    tissue <- get(paste0("tissue",j))
    m <- match(gene,ge_blood$gene_id) #doesn't matter if ge_blood or ge_brain
    pos_file[row_count,] <- c(tissue,paste0("GeneExpressionModel_",gene,"_",tissue,".wgt.RDat"),paste0("InClass_Day2/",gene,":",tissue),
                              strsplit(ge_blood$chr[m], split = "chr")[[1]][2], ge_blood$start[m], ge_blood$end[m], ncol(get(paste0("ge_",tissue)))-4)
    row_count <- row_count + 1
  }
}

colnames(pos_file) <- c("PANEL","WGT","ID","CHR","P0","P1","N")
write.table(pos_file, file = "twas.pos", row.names=F, col.names = T, sep = "\t", quote = F)

Now we have all of our input files! Let's remind ourselves what the GWAS summary statistics look like and then we will run TWAS. 

In [None]:
gwas <- fread("InClass_Day2/PASS_Alzheimers_deRojas2021.sumstats", header = T, nrows = 10) #just a sample of rows
head(gwas)
 
#Then, we need to run a different FUSION script. We will execute this in bash, but run it from R. Notice, now that we generated the POS file, we don't have to individually specify each gene. Notice, if these genes came from a different chromosome, we would have to run separate TWAS commands. 
system("for sumstats in PASS_Alzheimers_deRojas2021.sumstats PASS_UC_deLange2017.sumstats
do 
FUSION.assoc_test --sumstats InClass_Day2/${sumstats} --weights twas.pos --weights_dir . --ref_ld_chr InClass_Day2/1000G.EUR. --chr 1 --out TWAS_${sumstats}.dat
done")

#Look at the log: no genes were skipped. Sometimes TWAS skips genes when there are no GWAS SNPs near the eQTL model SNPs.

Let's check out the results of the TWAS. Which genes are significantly associated with UC vs Alzheimer's? 

In [None]:
#### YOUR CODE HERE ####