This notebook contains the instructions for reproducing results presented in  "*Environmental and genealogical signals on DNA methylation in a widespread apomictic dandelion lineage*" by V.N. Ibañez, M. van Antro, C. Peña Ponton, S. Ivanovic, C.A.M. Wagemaker, F. Gawehns, K.J.F. Verhoeven.

## Load data and set R environment

In this section, we will load the dataset to run the script, configure the working directory and environment.

In [None]:
#@title Load files
%load_ext rpy2.ipython
!rm -r *
!mkdir results rawData annotation scripts plots tmp

!wget -c -O scripts/commonFunctions.R https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/Rscripts/commonFunctions.R
!wget -c -O rawData/AseI-NsiI_Design_withPlotInfos.txt https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/data4r/AseI-NsiI_Design_withPlotInfos.txt
!wget -c -O rawData/Csp6I-NsiI_Design_withPlotInfos.txt https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/data4r/Csp6I-NsiI_Design_withPlotInfos.txt

!wget -c -O rawData/AseI-NsiI_methylation.filtered https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/data4r/AseI-NsiI_petite.methylation.filtered
!wget -c -O rawData/Csp6I-NsiI_methylation.filtered https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/data4r/Csp6I-NsiI_petite.methylation.filtered



!wget -c -O annotation/Csp6I-NsiI_mergedAnnot.csv https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/data4r/Csp6I-NsiI_mergedAnnot.csv
!wget -c -O annotation/AseI-NsiI_mergedAnnot.csv https://raw.githubusercontent.com/VeronicaNoe/epiTree/main/data4r/AseI-NsiI_mergedAnnot.csv


In [3]:
%%R
#@title Set R environment
rm(list=ls())
wd<-getwd()
baseDir <- gsub("/results", "", wd)
scriptDir <- file.path(baseDir, "scripts")


In [4]:
%%R
#@title Install R packages
install.packages(c("data.table","reshape2","dplyr","plyr", "BiocManager"),quiet=TRUE)





In [5]:
%%R
#@title (take a lot of time)
BiocManager::install("DSS", quiet=TRUE)  

'?repositories' for details

replacement repositories:
    CRAN: https://cran.rstudio.com









Update all/some/none? [a/s/n]: n


In [6]:
%%R
#@title Load packages silently
## load packages silently
suppressPackageStartupMessages({
  library("data.table") # file reading
  library("DSS")
  library("plyr")
  library("dplyr")
  library("reshape2")
  #library("tidyr")
  #library("tidyverse")
  source(file.path(scriptDir, "commonFunctions.R"))
})

# Analyzing data step-by-step

In this section, we will explore chunk of code to filter the one dataset: *AseI-NsiI*


## Load and explore data

In [7]:
#@title
%%R
RE<-"AseI-NsiI"
designTable <- file.path(paste0(baseDir, "/rawData/",RE, "_Design_withPlotInfos.txt"))
infileName <- file.path(paste0(baseDir,"/rawData/",RE,"_methylation.filtered"))
annotationFile <- file.path(paste0(baseDir, "/annotation/",RE, "_mergedAnnot.csv"))
## Load data
sampleTab <- f.read.sampleTable(designTable) # see commonFunctions.R
Data <- f.load.methylation.bed(infileName) # see commonFunctions.R
Trt<-c()
Ac<-c()
feature <- c("gene", "transposon", "repeat", "nothing")
ctxt <- c("CHH", "CG", "CHG")
dataFeatCtxt<-c()
dfDMC.treat<-NULL
dfDMC.acc<-NULL


===  2022 Sep 17 03:13:14 PM === Removing 0 samples due to the sampleRemovalInfo column 


In [17]:
#@title calculate DMC
%%R

for (i in 1:length(feature)){
  subAnno <- f.load.merged.annotation(annotationFile, feature[i]) # see commonFunctions.R
  if (feature[i] != "all"){
    toKeep <- gsub("chr", "", rownames(subAnno))
    before <- nrow(Data)
    commonChr <- sort(intersect(toKeep, Data$chr))
    myData <- Data[as.character(Data$chr) %in% as.character(commonChr),]
    afterFeature <- nrow(myData)
    cat(paste("Subsetting by feature: ",feature[i], dim(myData)[1], "rows \n"))
  }
  for (j in 1:length(ctxt)){
    if (ctxt[j] != "all") {
      contextFilter <- ctxt[j]
      myD <- subset(myData, context == contextFilter)
    }
    allSamples <- gsub("_total$", "", grep("_total$", colnames(myD), value = TRUE))
    sampleTab <- sampleTab[allSamples,]
    myD$chr <- as.numeric(myD$chr)
    forDSS <- list()
    for (curSample in allSamples) {
      tempTab <- data.frame(
        chr = myD$chr,
        pos = myD$pos,
        N = myD[[paste0(curSample, "_total")]],
        X = myD[[paste0(curSample, "_methylated")]],
        stringsAsFactors = FALSE
        )
      forDSS[[curSample]] <- tempTab
    }
    myBS <- makeBSseqData(forDSS, names(forDSS))
    myFit <- DMLfit.multiFactor(myBS, sampleTab, formula=~Treat+Acc)
    cat(paste(ctxt[j],dim(myD)[1], "rows \n"))
    testRes.Treat <- DMLtest.multiFactor(myFit, term="Treat")
    testRes.Acc <- DMLtest.multiFactor(myFit, term="Acc")
    write.csv(testRes.Treat, paste0(baseDir,"/tmp/",RE,"_","Treat_",ctxt[j],"_", feature[i], "_DMC_analysis.csv"),row.names = FALSE)
    write.csv(testRes.Acc, paste0(baseDir,"/tmp/",RE,"_","Acc_",ctxt[j],"_", feature[i],"_DMC_analysis.csv"),row.names = FALSE)
  }
}

Subsetting by feature:  gene 3712 rows 
Fitting DML model for CpG site: CHH 2615 rows 
Fitting DML model for CpG site: CG 610 rows 
Fitting DML model for CpG site: CHG 487 rows 
Subsetting by feature:  transposon 3312 rows 
Fitting DML model for CpG site: CHH 2290 rows 
Fitting DML model for CpG site: CG 604 rows 
Fitting DML model for CpG site: CHG 418 rows 
Subsetting by feature:  repeat 1870 rows 
Fitting DML model for CpG site: CHH 1406 rows 
Fitting DML model for CpG site: CG 266 rows 
Fitting DML model for CpG site: CHG 198 rows 
Subsetting by feature:  nothing 12525 rows 
Fitting DML model for CpG site: CHH 8828 rows 
Fitting DML model for CpG site: CG 2231 rows 
Fitting DML model for CpG site: CHG 1466 rows 


## Make distance files for each context



In [18]:
#@title
%%R
#### summary table
inFiles <- list.files(path=paste0(baseDir,"/tmp"), pattern = "_DMC_analysis.csv$")
toSave<-c()
outdf<-matrix(NA, nrow=length(inFiles), ncol = 7)
for (i in 1:length(inFiles)){
  minFDR<-0.05
  input<-read.csv(paste0(baseDir,"/tmp/",inFiles[i]),header=TRUE, stringsAsFactors = FALSE, sep=",")
  fileNames<-strsplit(inFiles[i], "_" )
  subDF<-filter(input, fdrs <= 0.05)
  colnames(outdf)<-c("RE","Factor","Context","Feature","# DMC","Total_Cs","Region")
  outdf[i,1]<-fileNames[[1]][1] #which RE
  outdf[i,2]<-fileNames[[1]][2] #which factor
  outdf[i,3]<-fileNames[[1]][3] #which context
  outdf[i,4]<-fileNames[[1]][4] #which feature
  outdf[i,5]<-nrow(subDF)
  outdf[i,6]<-nrow(input)
  uniReg<-unique(input$chr)
  temp<-matrix(NA, nrow=length(uniReg), ncol = 2)  
  for(j in 1:length(uniReg)){
    hits<-sum(subDF$chr==uniReg[j])
    colnames(temp) <- c("chr","ocurrences")
    temp[j,1]<-uniReg[j]
    temp[j,2]<-hits
  }
  temp<-data.frame(temp)
  temp <- temp[order(-temp$ocurrences),]
  temp<-dplyr::filter(temp, ocurrences>=5)
  if (dim(temp)[1]!= 0){
    outdf[i,7]<-"yes"
    write.table(temp,paste0(inFiles[i], "_DMC_region.csv"),row.names = FALSE, sep="\t", col.names=T, quote=FALSE)
  } else {
    outdf[i,7]<-"no"
  }
}
write.table(outdf,paste0("00_DMC_summary.csv"),row.names = FALSE, sep="\t", col.names=T, quote=FALSE)



In [None]:
#### unique table
inFiles <- list.files(path=paste0(baseDir,"/tmp"), pattern = "_DMC_analysis.csv$")
toSave<-c()
for (i in 1:length(inFiles)){
  input<-read.csv(paste0(baseDir,"/tmp/",inFiles[i]), header=TRUE, stringsAsFactors = FALSE, sep=",")
  fileNames<-strsplit(inFiles[i], "_" )
  input$RE<-rep(fileNames[[1]][1], times=nrow(input))
  input$factor<-rep(fileNames[[1]][2], times=nrow(input))
  input$context<-rep(fileNames[[1]][3], times=nrow(input))
  input$feature<-rep(fileNames[[1]][4], times=nrow(input))
  toSave<-rbind(toSave, input)
}
write.table(toSave, "00_DMC_table.csv",row.names = FALSE, sep="\t", col.names=T, quote=FALSE)

# DMC

In this section, the code will run the previous steps for both datasets: *AseI-NsiI* and *Csp6I-NsiI*

In [19]:
%%R
#@ title Characterize both data set: AseI-NsiI and Csp6I-NsiI
## process both data set
RE<-c("AseI-NsiI", "Csp6I-NsiI")
for (r in 1:length(RE)){
  designTable <- file.path(paste0(baseDir, "/rawData/",RE[r], "_Design_withPlotInfos.txt"))
  infileName <- file.path(paste0(baseDir,"/rawData/",RE[r],"_methylation.filtered"))
  annotationFile <- file.path(paste0(baseDir, "/annotation/",RE[r], "_mergedAnnot.csv"))
  sampleTab <- f.read.sampleTable(designTable) # see commonFunctions.R
  Data <- f.load.methylation.bed(infileName) # see commonFunctions.R
  Trt<-c()
  Ac<-c()
  feature <- c("gene", "transposon", "repeat", "nothing")
  ctxt <- c("CHH", "CG", "CHG")
  dataFeatCtxt<-c()
  dfDMC.treat<-NULL
  dfDMC.acc<-NULL
  for (i in 1:length(feature)){
    subAnno <- f.load.merged.annotation(annotationFile, feature[i]) # see commonFunctions.R
    if (feature[i] != "all") {
      toKeep <- gsub("chr", "", rownames(subAnno))
      before <- nrow(Data)
      commonChr <- sort(intersect(toKeep, Data$chr))
      myData <- Data[as.character(Data$chr) %in% as.character(commonChr),]
      afterFeature <- nrow(myData)
      cat(paste("Subsetting by feature: ",feature[i], dim(myData)[1], "rows \n"))
      }
    for (j in 1:length(ctxt)){
      if (ctxt[j] != "all") {
        contextFilter <- ctxt[j]
        myD <- subset(myData, context == contextFilter)
        }
      allSamples <- gsub("_total$", "", grep("_total$", colnames(myD), value = TRUE))
      sampleTab <- sampleTab[allSamples,]
      myD$chr <- as.numeric(myD$chr)
      forDSS <- list()
      for (curSample in allSamples) {
        tempTab <- data.frame(
          chr = myD$chr,
          pos = myD$pos,
          N = myD[[paste0(curSample, "_total")]],
          X = myD[[paste0(curSample, "_methylated")]],
          stringsAsFactors = FALSE
          )
        forDSS[[curSample]] <- tempTab
        }
      myBS <- makeBSseqData(forDSS, names(forDSS))
      myFit <- DMLfit.multiFactor(myBS, sampleTab, formula=~Treat+Acc)
      cat(paste(ctxt[j],dim(myD)[1], "rows \n"))
      testRes.Treat <- DMLtest.multiFactor(myFit, term="Treat")
      testRes.Acc <- DMLtest.multiFactor(myFit, term="Acc")
      write.csv(testRes.Treat, paste0(baseDir,"/tmp/",RE[r],"_","Treat_",ctxt[j],"_", feature[i], "_DMC_analysis.csv"),row.names = FALSE)
      write.csv(testRes.Acc, paste0(baseDir,"/tmp/",RE[r],"_","Acc_",ctxt[j],"_", feature[i],"_DMC_analysis.csv"),row.names = FALSE)
    }
  }
}
#### summary table
inFiles <- list.files(path=paste0(baseDir,"/tmp"), pattern = "_DMC_analysis.csv$")
toSave<-c()
outdf<-matrix(NA, nrow=length(inFiles), ncol = 7)
  for (i in 1:length(inFiles)){
    minFDR<-0.05
    input<-read.csv(paste0(baseDir,"/tmp/",inFiles[i]),header=TRUE, stringsAsFactors = FALSE, sep=",")
    fileNames<-strsplit(inFiles[i], "_" )
    subDF<-filter(input, fdrs <= 0.05)
    colnames(outdf)<-c("RE","Factor","Context","Feature","# DMC","Total_Cs","Region")
    outdf[i,1]<-fileNames[[1]][1] #which RE
    outdf[i,2]<-fileNames[[1]][2] #which factor
    outdf[i,3]<-fileNames[[1]][3] #which context
    outdf[i,4]<-fileNames[[1]][4] #which feature
    outdf[i,5]<-nrow(subDF)
    outdf[i,6]<-nrow(input)
    uniReg<-unique(input$chr)
    temp<-matrix(NA, nrow=length(uniReg), ncol = 2)  
    for(j in 1:length(uniReg)){
      hits<-sum(subDF$chr==uniReg[j])
      colnames(temp) <- c("chr","ocurrences")
      temp[j,1]<-uniReg[j]
      temp[j,2]<-hits
    }
    temp<-data.frame(temp)
    temp <- temp[order(-temp$ocurrences),]
    temp<-dplyr::filter(temp, ocurrences>=5)
    if (dim(temp)[1]!= 0){
      outdf[i,7]<-"yes"
      write.table(temp,paste0(inFiles[i], "_DMC_region.csv"),row.names = FALSE, sep="\t", col.names=T, quote=FALSE)
    } else {
      outdf[i,7]<-"no"
    }
  }
write.table(outdf,paste0("00_DMC_summary.csv"),row.names = FALSE, sep="\t", col.names=T, quote=FALSE)

#### unique table
inFiles <- list.files(path=paste0(baseDir,"/tmp"), pattern = "_DMC_analysis.csv$")
toSave<-c()
for (i in 1:length(inFiles)){
  input<-read.csv(paste0(baseDir,"/tmp/",inFiles[i]), header=TRUE, stringsAsFactors = FALSE, sep=",")
  fileNames<-strsplit(inFiles[i], "_" )
  input$RE<-rep(fileNames[[1]][1], times=nrow(input))
  input$factor<-rep(fileNames[[1]][2], times=nrow(input))
  input$context<-rep(fileNames[[1]][3], times=nrow(input))
  input$feature<-rep(fileNames[[1]][4], times=nrow(input))
  toSave<-rbind(toSave, input)
}
write.table(toSave, "00_DMC_table.csv",row.names = FALSE, sep="\t", col.names=T, quote=FALSE)


===  2022 Sep 17 03:36:19 PM === Removing 0 samples due to the sampleRemovalInfo column 
Subsetting by feature:  gene 3712 rows 
Fitting DML model for CpG site: CHH 2615 rows 
Fitting DML model for CpG site: CG 610 rows 
Fitting DML model for CpG site: CHG 487 rows 
Subsetting by feature:  transposon 3312 rows 
Fitting DML model for CpG site: CHH 2290 rows 
Fitting DML model for CpG site: CG 604 rows 
Fitting DML model for CpG site: CHG 418 rows 
Subsetting by feature:  repeat 1870 rows 
Fitting DML model for CpG site: CHH 1406 rows 
Fitting DML model for CpG site: CG 266 rows 
Fitting DML model for CpG site: CHG 198 rows 
Subsetting by feature:  nothing 12525 rows 
Fitting DML model for CpG site: CHH 8828 rows 
Fitting DML model for CpG site: CG 2231 rows 
Fitting DML model for CpG site: CHG 1466 rows 
===  2022 Sep 17 03:36:43 PM === Removing 0 samples due to the sampleRemovalInfo column 
Subsetting by feature:  gene 2248 rows 
Fitting DML model for CpG site: CHH 1536 rows 
Fitting D