# BNFO62: Population Genetics, Day 1

**Author:** Tiffany Amariuta

#### Create a symbolic link to the data

In [None]:
system("ln -sfn ~/public/popgen/InClass_Day1 ~/module-10-popgen/Day1/InClass_Day1")

#### Load required libraries 
Check that these are installed without error

In [None]:
library(data.table)
library(susieR)

### Today we will be using SuSiE, a tool for fine-mapping analysis. 
Use the SuSiE tutorial website as a reference: https://stephenslab.github.io/susieR/articles/finemapping_summary_statistics.html

Our gene of interest is ENSG00000078403 Chromosome 10: 21,524,646-21,743,630 

**Exercise 1.**

Extract summary statistics for Breast Cancer GWAS for only the SNPs in the window corresponding to our gene of interest. 
1. Find the SNPs that are within 500kb +/- the gene body using the 1000G.EUR.10.bim file 
2. save this file to variable y, then write table as shown below. 

In [None]:
y <- fread("InClass_Day1/UKB_460K.cancer_BREAST.sumstats.gz", header = T)

head(y)

snps <- fread("InClass_Day1/1000G.EUR.10.bim", header = F)
w <- which(snps$V4 > 21524646 - 500000 & snps$V4 < 21743630 + 500000) #give a window around it
ww <- which(y$SNP %in% snps$V2[w]) #422
y <- y[ww,]

write.table(y, file = "BrCa_GWAS_ENSG00000078403_locus.txt", row.names = F, col.names = T, sep = "\t", quote = F)
system("gzip -f BrCa_GWAS_ENSG00000078403_locus.txt")

**Exercise 2.**

Make a locus zoom plot using one of SuSiE's built-in functions. A locus zoom plot has genomic coordinate on the x-axis and -log10p-value on the y-axis. 

In [None]:
sumstats <- fread("BrCa_GWAS_ENSG00000078403_locus.txt.gz", header = T)
susie_plot(sumstats$Z, y = "z")

**Exercise 3.**

Run SuSiE to perform fine-mapping of this locus. You will need information about linkage disequilibrium between pairs of variants in order to do this. I have already computed the linkage disequilibrium between pairs of variants in this locus. 

Once you have run SuSiE, use the susie_plot() function on the output with y = "PIP" to visualize the credible set SNPs. 

How many credible set SNPs did you find? 

In [None]:
R <- fread("InClass_Day1/LDmatrix_ENSG00000078403_locus.txt.gz", header = F) #Tiffany did this ahead of time using plink2R and computing LD using 1KG individuals for this locus
fitted_rss = susie_rss(z = sumstats$Z, R = as.matrix(R), n = 459324, L = 10,
                        estimate_residual_variance = TRUE)

susie_plot(fitted_rss, y="PIP")
summary(fitted_rss)$cs #there are 6 credible sets, each with one causal variant. 

**Exercise 4.**

What are the genomic coordinates of the putative causal SNPs? 

In [None]:
pos <- snps$V4[match(sumstats$SNP[as.numeric(summary(fitted_rss)$cs$variable)], snps$V2)]
#chr10: 21849769 21989245 21983960 21856475 21884471 22062729

**Exercise 5.**

What is L and how does changing L change the output of SuSiE, if at all? Test L = {1,2,3,4,5,6,7,8,9,10}.

In [None]:
for (i in 1:10){
  fitted_rss = susie_rss(z = sumstats$Z, R = as.matrix(R), n = 459324, L = i,
                        estimate_residual_variance = TRUE)
  print(summary(fitted_rss)$cs)
  susie_plot(fitted_rss, y="PIP")
}

**Exercise 6.**

Is it reasonable that there are this many causal variants in one locus? What else might be going on to explain this behavior? 


ANSWER: The causal variant may be absent from the dataset, and these 6 variants equally tag this missing causal variant, so SuSiE cannot do any better than to say all 6 of these are possibilities of being causal.

**Exercise 7.**

How does run time scale with the number of SNPs in a locus? 

In [None]:
#You may use the following function to time your code: 

start <- proc.time()
#run some code 
elapsed_time <- proc.time() - start
elapsed_time <- elapsed_time[3]


elapsed_time <- c()
snp_size <- seq(2,ncol(R),20)
for (i in snp_size){
  print(i)
  start <- proc.time()
  fitted_rss = susie_rss(z = sumstats$Z[1:i], R = as.matrix(R[1:i,1:i]), n = 459324, L = 10,
                        estimate_residual_variance = TRUE)
  j <- proc.time() - start
  elapsed_time <- c(elapsed_time,  j[3])
}
plot(snp_size,elapsed_time)