---
### **Data Bootcamp for Genomic Prediction in Plant Breeding** ###
### **University of Minnesota Plant Breeding Center** ###
#### **June 20 - 22, 2022** ####
---

### **Practical 2: Genomic Prediction using Bayesian Models in BGLR** ###

<br />
<br />

#### **Source Scripts and Load Data**


In [6]:
WorkDir <- getwd()
setwd(WorkDir)

##Source in functions to be used
source("R_Functions/GS_Pipeline_Jan_2022_FnsApp.R")
source("R_Functions/bootcamp_functions.R")
gc()



Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



   *****       ***   vcfR   ***       *****
   This is vcfR 1.12.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****



Attaching package: 'bWGR'


The following objects are masked from 'package:NAM':

    CNT, GAU, GRM, IMP, KMUP, KMUP2, SPC, SPM, emBA, emBB, emBC, emBL,
    emCV, emDE, emEN, emGWA, emML, emML2, emRR, markov, mkr, mkr2X,
    mrr, mrr2X, mrrFast, wgr



Attaching package: 'emoa'


The following object is masked from 'package:dplyr':

    coalesce



Attaching package: 'MASS'


The following object is masked from 'package:dplyr':

    select



Attaching package: 'sommer'


The following objects are masked from 'package:rrBLUP':

    A.mat, GWAS


Welcome to rTASSEL (version 0.9.26)
 <U+2022> Conside

Unnamed: 0,used,(Mb),gc trigger,(Mb).1,max used,(Mb).2
Ncells,5804502,310.0,8802824,470.2,6426122,343.2
Vcells,10019363,76.5,15504737,118.3,12851603,98.1


#### **Read Genotype File using vcfR** ####

In [7]:

##Load in genotype data. Use package vcfR to read in and work with vcf file.
infileVCF <- "Data/SoyNAM_Geno.vcf"
genotypes_VCF <- read.table(infileVCF)
vcf <- read.vcfR(infileVCF, verbose = FALSE)
vcf

***** Object of Class vcfR *****
5189 samples
20 CHROMs
4,292 variants
Object size: 171.1 Mb
25.41 percent missing data
*****        *****         *****


#### **Convert VCF file format to numerical matrix format.**
#### Final genotype matrix is geno_num

In [6]:
gt <- extract.gt(vcf, element = "GT", as.numeric = F)
fix_T <- as_tibble(getFIX(vcf))
gt2 <- matrix(0, ncol = ncol(gt), nrow = nrow(gt))
colnames(gt2) <- colnames(gt)
rownames(gt2) <- rownames(gt)
gt2a <- apply(gt,2, function(x) gsub("1/1","1",x))
gt2b <- gsub("0[/|]0","0",gt2a)
gt2c <- gsub("[10][/|][10]","0.5",gt2b)
gt2d <- gsub("\\.[/|]\\.","NA",gt2c)

gt2d_num<- apply(gt2d,2,as.numeric)
rownames(gt2d_num)<- rownames(gt2d)
geno_num <- t(gt2d_num)
dim(geno_num)
rm(list=grep("gt2",ls(),value=TRUE))


#### **Filter Genotypic Data**

In [4]:
##Filter markers on % missing
miss <- function(x){length(which(is.na(x)))}
mrkNA <- (apply(geno_num, MARGIN=2, FUN=miss))/dim(geno_num)[1]
ndx <- which(mrkNA > 0.2)

if (length(ndx)>0) geno_num2 <- geno_num[, -ndx] else geno_num2 <- geno_num

##Filter individuals on % missing
indNA <- (apply(geno_num2, MARGIN=1, FUN=miss))/dim(geno_num2)[2]
ndx2 <- which(indNA > 0.5)

 if(length(ndx2)>0) geno_num3 <- geno_num2[-ndx2, ] else geno_num3 <- geno_num2


##Filter markers based on MAF
maf <- apply(geno_num3, MARGIN=2, FUN=mean, na.rm=T)
ndx3 <- which(maf<0.05 | maf>0.95) 

if (length(ndx3)>0) geno_num4 <- geno_num2[, -ndx3] else geno_num4 <- geno_num3
  
dim(geno_num4)

#### **Import Phenotypic Data and Merge Geno-Pheno Data**

In [5]:

pheno <- read.csv("Data/SoyNAM_Pheno.csv")

geno_num4_x <- cbind(rownames(geno_num4),geno_num4)
colnames(geno_num4_x)[1]<- "strain"

### Check strain names have same format in pheno and geno 
pheno[,1] <- gsub("[-.]","",pheno[,1])
geno_num4_x[,1] <- gsub("[-.]","",geno_num4_x[,1])

## Merge Geno and Pheno Data
Data <- merge(geno_num4_x,pheno,by="strain",all=TRUE)

## Remove with missing yiled_blup values 

YldNA_Indices <- which(is.na(Data$yield))
if(length(YldNA_Indices) >0){Data_Sub <- Data[-YldNA_Indices,]}else{Data_Sub <- Data}


genoStrain <- unique(as.character(geno_num4_x[,"strain"]))

genoStrainIndices <- which(Data_Sub[,"strain"] %in% genoStrain)
length(genoStrainIndices)
genoIndices <- grep("ss",colnames(geno_num4_x))
initGenoIndx <- genoIndices[1]
finalGenoIndx <- genoIndices[length(genoIndices)]
phenoIndices <- c(1,c((finalGenoIndx+1):ncol(Data_Sub)))

pheno_sub <- Data_Sub[genoStrainIndices,phenoIndices]
geno_num4b <- Data_Sub[genoStrainIndices,c(1,genoIndices)]


uniqueStrainIndices<- which(!duplicated(geno_num4b[,"strain"]))

if(length(uniqueStrainIndices)>0) {geno_num5 <- geno_num4b[uniqueStrainIndices,]}else{geno_num5 <- geno_num4b}

dim(geno_num5)

rm(geno_num4b)
rm(geno_num4)
rm(geno_num3)
rm(geno_num2)

### set 'yield' colname to 'Yield_blup'

yldCol <- which(colnames(pheno_sub) %in% "yield")
colnames(pheno_sub)[yldCol] <- "Yield_blup" 



#### **Subset Environments** 

In [6]:
### Select 3 environs with largest number of evaluations (lines)  

env_sub <-  names(which(table(pheno_sub[,"environ"])>5100)[1:3])

env_sub_indices <- which(pheno_sub[,"environ"] %in% env_sub)

## Subset Data and Geno tables 
DT <- pheno_sub[env_sub_indices,]

DT$environ <- as.factor(DT$environ)

dim(DT)

#### **Impute Genotype Table** ###

In [7]:
#### Impute genotable using markov function from 'NAM' package 

geno_imp <- markov(apply(geno_num5[,-1],2,as.numeric))
rownames(geno_imp) <- geno_num5[,"strain"]
dim(geno_imp)

In [None]:
# Reduce the number of RILs in the dataset simply for the sake of saving time in computation for demonstration (we don't want to spend all of our time watching our computer work!)

ssNdx <- sample.int(n=dim(pheno2)[1], size=5000)
geno_imp_sub <- geno_imp[ssNdx, ]
pheno2_sub <- pheno2[ssNdx, ]


In [None]:
### BGLR model fitting ----
# Use the BGLR package to fit various types of models. BRR = Bayesian ridge regression, BL = Bayes LASSO, BayesA, BayesB, BayesC 


In [8]:
# Remove some data to perform a validation analysis
# Use line coding to identify RILs by family. 
fam <- gsub("...$", "", rownames(geno_imp_sub))
ndxFam <- which(fam=="DS11-64")

pheno2_sub_trn <- pheno2_sub

pheno2_sub_trn$Seedsize[ndxFam] <- NA

G <- A.mat(geno_imp_sub)

ETA <- list(list(K=NULL, X=geno_imp_sub, model='BayesB', probIn=.10))

model_bglr <- BGLR(y=pheno2_sub_trn$Seedsize, ETA=ETA, burnIn=500, nIter=2000, verbose=FALSE)
gebv_bglr <- model_bglr$yHat

# Extract marker effect predictions from model object. Try different models, changing the name of the object storing the effect (e.g,, "bhat_brr") and plot them against one another on a scatter plot.

bhat <- model_bglr$ETA[[1]]$b

# Here is a way to make a trace plot
plot(bhat_brr^2, ylab='Estimated squared marker effect', type='o')


##Correlate predictions of RILs left out of the analysis, with predictions
cor(pheno2_sub$Seedsize[ndxFam],  gebv_bglr[ndxFam])
plot(pheno2_sub$Seedsize[ndxFam],  gebv_bglr[ndxFam])

# Fit a multi-kernel model using BGLR to treat some large-effect QTL as fixed effects, and remaining QTL as random effects. QTL here were previously declared significant using a GWAS analysis. SNP positions of QTL were 1926, 829, 683, 678.
qtl <- c(1926, 829, 683, 678)

ETA_mk <- list(list(X=geno_imp_sub[, qtl], model='FIXED', probIn=.10), list(K=G, X=geno_imp_sub[, -qtl], model='RKHS', probIn=.10))

model_bglr_mk <- BGLR(y=pheno2_sub_trn$Seedsize, ETA=ETA_mk, burnIn=500, nIter=2000, verbose=FALSE)

gebv_bglr_mk <- model_bglr_mk$yHat

cor(pheno2_sub$Seedsize[ndxFam],  gebv_bglr_mk[ndxFam])
plot(pheno2_sub$Seedsize[ndxFam],  gebv_bglr_mk[ndxFam])







#### **Discuss other ways to model these scenarios and refine these models**