Process raw Infinium 850k methylation array raw data using meffil R package: https://github.com/perishky/meffil/wiki/
- this notebooks peforms QC and removal of bad quality samples
- Date: 27.10.25

### Setup

In [1]:
R.version

               _                           
platform       x86_64-conda-linux-gnu      
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          5.1                         
year           2025                        
month          06                          
day            13                          
svn rev        88306                       
language       R                           
version.string R version 4.5.1 (2025-06-13)
nickname       Great Square Root           

In [2]:
## load libraries
library(stringr)
library(data.table) 
library(vroom)
library(ggplot2)
library(tidyr)
library(limma)
library(meffil)
library(readxl)
library(dplyr)


Loading required package: illuminaio

Loading required package: MASS

Loading required package: lmtest

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:data.table’:

    yearmon, yearqtr


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Loading required package: sandwich

Loading required package: sva

Loading required package: mgcv

Loading required package: nlme

This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.

Loading required package: genefilter


Attaching package: ‘genefilter’


The following object is masked from ‘package:MASS’:

    area


The following object is masked from ‘package:vroom’:

    spec


Loading required package: BiocParallel

Loading required package: plyr

Loading required package: reshape2


Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


The following objects are masked from ‘package:data.table’:

  

In [5]:
# set wd
setwd('/exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/PHD/EBB_methylation/')

In [6]:
# set # of cores
library(parallel)
cores = detectCores()
cores
options(mc.cores=cores)

In [7]:
# generate sample sheet
samplesheet <- meffil.create.samplesheet('BrainSamples/data//idats_140716', recursive=TRUE)
# update sex and Sample name
batch <- fread('metadata/pheno.txt')

upd <- 
fread('BrainSamples/data/idats_140716/Samples_Table_140716.csv') %>% 
    mutate(Sample_Name = paste0(`Sentrix Barcode`, "_", `Array`)) %>%
    dplyr::select(c('Sample ID', 'Sample_Name')) %>%
    dplyr::rename('Sample_Name2' = 'Sample ID')

samplesheet <-
inner_join(samplesheet, upd, by = 'Sample_Name') %>% 
    dplyr::mutate(Sample_Name = Sample_Name2) %>%
    dplyr::select(-c(Sample_Name2)) %>%
    dplyr::select(-c(Sex)) %>%
    left_join(., batch, by = c('Sample_Name'='sample.ID')) %>%
    dplyr::rename('Sex' = 'sex')
samplesheet %>% head
samplesheet %>% dim

Unnamed: 0_level_0,Sample_Name,Slide,sentrix_row,sentrix_col,Basename,Sex,batch,tissue.region,sample.wait.time
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1,SD001/11B,200514040135,1,1,BrainSamples/data//idats_140716/200514040135_R01C01,M,2056G,Cortex,5
2,SD033/10,200514040135,2,1,BrainSamples/data//idats_140716/200514040135_R02C01,M,2056G,Cortex,0
3,SD024/08,200514040135,3,1,BrainSamples/data//idats_140716/200514040135_R03C01,M,2056G,Cortex,5
4,SD039/08,200514040135,4,1,BrainSamples/data//idats_140716/200514040135_R04C01,M,2056G,Cortex,5
5,SD043/06,200514040135,5,1,BrainSamples/data//idats_140716/200514040135_R05C01,F,2056G,Cortex,5
6,SD034/09B,200514040135,6,1,BrainSamples/data//idats_140716/200514040135_R06C01,M,2056G,Cortex,0


### Quality Control

Next perform the background correction, dye bias correction, sex prediction and cell count estimates. The `meffil.qc` function processes your idat files and returns a `qc.object` for each sample. You need to specify which cell type reference you need. You can find cell type references with the `meffil.list.cell.type.references()` function. Currently there are whole blood and cord blood references implemented. 
- Set `cell.type.reference` to 'guintivano dlpfc' (brain tissue) 
- see: https://github.com/perishky/meffil/wiki/Sample-QC

In [8]:
meffil.list.cell.type.references()
qc.objects <- meffil.qc(samplesheet, cell.type.reference='guintivano dlpfc', verbose=FALSE)
save(qc.objects,file="meffil_data/qc.objects.Robj")

In [9]:
names(qc.objects) %>% head
length(qc.objects)

### Remove samples with low genotype concordance 

- If you have genotype data available on the same individuals with methylation profiles you can check for ID mismatches. The methylation arrays have 65 SNPs which can be extracted from the methylation data. These 65 SNPs can be compared to genotypes measured with genotype arrays.
- see `meffil_00_match_samples.ipynb` for details, where the genotype concordance was calculated 

In [11]:
### Load meffil objects
# load('meffil_data/qc.objects.Robj')
# load('meffil_data/qcsummary.Robj')

In [10]:
### remove low concordance samples < 0.75
genconc <- fread('meffil_data/sampleID.mapping.genconc.txt')
bad.conc.samples <- genconc %>% filter( gen.concordance < 0.75) %>% pull(sample.ID)
cat(length(bad.conc.samples), 'samples removed due to low genotype concordance:', bad.conc.samples, '\n')

qc.objects.conc <- meffil.remove.samples(qc.objects, bad.conc.samples)
cat(length(qc.objects.conc), 'samples remaining')

10 samples removed due to low genotype concordance: SD010/09 SD023/11B SD023/13 SD030/09 SD032/08 SD036/10 SD036/13 SD036/14B SD038/08 SD042/13 
126 samples remaining

In [12]:
featureset <- qc.objects[[2]]$featureset
featureset
#writeLines(meffil.snp.names(featureset), con="snp-names.txt")

In [None]:
# genotypes extracted using:
# >plink2 --pfile genotypingdata/plink_files/pgen/imputed_allchr_newIDs --extract snp-names-newID.txt --recode A --out meffil_data/genotypes-imp-newIDs

In [13]:
## load genotypes (imputed)
genotypes0 <- meffil.extract.genotypes("meffil_data/genotypes-imp-newIDs.raw")
genotypes0 <- genotypes0[,match(names(qc.objects.conc), colnames(genotypes0))]
genotypes_df <- 
    as.data.frame(genotypes0) %>%
    tibble::rownames_to_column("gen.id")
## fix SNP names 
rsids <-
fread('snp-names-pvar-table.txt') %>% 
    select(snp, gen.id) %>% 
    mutate(
        gen.id = ifelse(str_detect(gen.id, "^X:"), gen.id, paste0("X", gen.id)),
        gen.id = gsub(":", ".", gen.id)
          )
genotypes_df2 <- left_join(genotypes_df, rsids, by = 'gen.id') %>% select(-c(gen.id))
genotypes     <- as.matrix(genotypes_df2[,!(names(genotypes_df2) %in% 'snp')])
rownames(genotypes) <- genotypes_df2$snp  
genotypes %>% head

Unnamed: 0,SD001/11B,SD033/10,SD024/08,SD039/08,SD043/06,SD034/09B,SD025/13,SD027/11,SD004/06,SD025/09,⋯,SD024/14B,SD008/09,SD032/09,SD022/08B,SD048/12,SD055/12,SD036/12,SD033/08,SD025/08,SD031/09
rs3936238,0,0,1,1,1,2,1,0,0,1,⋯,1,0,1,1,0,0,0,0,1,1
rs877309,1,1,2,0,0,2,1,2,1,0,⋯,1,1,2,0,1,2,1,0,1,1
rs213028,0,1,2,2,2,1,1,2,2,2,⋯,2,1,1,2,2,2,1,1,2,1
rs11249206,1,1,0,1,0,0,2,1,1,1,⋯,2,1,1,0,1,1,1,1,0,0
rs654498,0,1,0,1,0,0,0,1,0,0,⋯,1,0,1,2,2,2,2,0,1,0
rs715359,1,1,1,0,2,0,1,2,2,1,⋯,2,2,1,1,2,2,1,1,1,2


### Generate QC report

In [14]:
qc.parameters <- 
meffil.qc.parameters(
    meth.unmeth.outlier.sd                = 5,
	beadnum.samples.threshold             = 0.1,
	detectionp.samples.threshold          = 0.1,
	detectionp.cpgs.threshold             = 0.1, 
	beadnum.cpgs.threshold                = 0.1,
	sex.outlier.sd                        = 5,
	snp.concordance.threshold             = 0.95,
	sample.genotype.concordance.threshold = 0.75
)

In [15]:
## full dataset
qc.summary <- meffil.qc.summary(
	qc.objects,
	parameters = qc.parameters,
	genotypes=genotypes
)

save(qc.summary, file="meffil_data/qcsummary.Robj")

[meffil.qc.summary] Tue Nov  4 09:54:03 2025 Sex summary TRUE 
[meffil.qc.summary] Tue Nov  4 09:54:03 2025 Meth vs unmeth summary 


“[1m[22mThe `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
[36mℹ[39m The deprecated feature was likely used in the [34mmeffil[39m package.
  Please report the issue to the authors.”


[meffil.qc.summary] Tue Nov  4 09:54:03 2025 Control means summary 
[meffil.qc.summary] Tue Nov  4 09:54:03 2025 Sample detection summary 
[meffil.qc.summary] Tue Nov  4 09:54:20 2025 CpG detection summary 
[meffil.qc.summary] Tue Nov  4 09:54:30 2025 Sample bead numbers summary 
[meffil.qc.summary] Tue Nov  4 09:54:47 2025 CpG bead numbers summary 
[meffil.qc.summary] Tue Nov  4 09:54:49 2025 Cell count summary 


“[1m[22mThe `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
[36mℹ[39m Please use the `fun` argument instead.
[36mℹ[39m The deprecated feature was likely used in the [34mmeffil[39m package.
  Please report the issue to the authors.”


[meffil.qc.summary] Tue Nov  4 09:54:49 2025 Genotype concordance 


In [16]:
meffil.qc.report(qc.summary, output.file="meffil_data/qc-report.html")

[meffil.qc.report] Tue Nov  4 09:55:22 2025 Writing report as html file to meffil_data/qc-report.html 




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/qc-report.rmd



1/38                   
2/38 [unnamed-chunk-1] 
3/38                   
4/38 [unnamed-chunk-2] 
5/38                   
6/38 [unnamed-chunk-3] 
7/38                   
8/38 [unnamed-chunk-4] 
9/38                   
10/38 [unnamed-chunk-5] 
11/38                   
12/38 [unnamed-chunk-6] 
13/38                   
14/38 [unnamed-chunk-7] 
15/38                   
16/38 [unnamed-chunk-8] 
17/38                   
18/38 [unnamed-chunk-9] 
19/38                   
20/38 [unnamed-chunk-10]
21/38                   
22/38 [unnamed-chunk-11]
23/38                   
24/38 [unnamed-chunk-12]
25/38                   
26/38 [unnamed-chunk-13]
27/38                   
28/38 [unnamed-chunk-14]
29/38                   
30/38 [unnamed-chunk-15]




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/cell-counts.rmd



1/4                   
2/4 [unnamed-chunk-35]
3/4                   
4/4 [unnamed-chunk-36]
31/38                   
32/38 [unnamed-chunk-16]
33/38                   
34/38 [unnamed-chunk-17]
35/38                   
36/38 [unnamed-chunk-18]




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/genotype-concordance.rmd



1/9                   
2/9 [unnamed-chunk-42]
3/9                   
4/9 [unnamed-chunk-43]
5/9                   
6/9 [unnamed-chunk-44]
7/9                   
8/9 [unnamed-chunk-45]
9/9                   
37/38                   
38/38 [unnamed-chunk-19]


output file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/PHD/EBB_methylation/meffil_data/qc-report.md




In [17]:
## high gen.conc dataset
qc.summary <- meffil.qc.summary(
	qc.objects.conc,
	parameters = qc.parameters,
	genotypes=genotypes
)

save(qc.summary, file="meffil_data/qcsummary.highconc.Robj")

[meffil.qc.summary] Tue Nov  4 09:55:54 2025 Sex summary TRUE 
[meffil.qc.summary] Tue Nov  4 09:55:54 2025 Meth vs unmeth summary 
[meffil.qc.summary] Tue Nov  4 09:55:54 2025 Control means summary 
[meffil.qc.summary] Tue Nov  4 09:55:54 2025 Sample detection summary 
[meffil.qc.summary] Tue Nov  4 09:56:10 2025 CpG detection summary 
[meffil.qc.summary] Tue Nov  4 09:56:10 2025 Sample bead numbers summary 
[meffil.qc.summary] Tue Nov  4 09:56:25 2025 CpG bead numbers summary 
[meffil.qc.summary] Tue Nov  4 09:56:27 2025 Cell count summary 
[meffil.qc.summary] Tue Nov  4 09:56:27 2025 Genotype concordance 


In [18]:
meffil.qc.report(qc.summary, output.file="meffil_data/qc-report.highconc.html")

[meffil.qc.report] Tue Nov  4 09:56:40 2025 Writing report as html file to meffil_data/qc-report.highconc.html 




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/qc-report.rmd



1/38                   
2/38 [unnamed-chunk-1] 
3/38                   
4/38 [unnamed-chunk-2] 
5/38                   
6/38 [unnamed-chunk-3] 
7/38                   
8/38 [unnamed-chunk-4] 
9/38                   
10/38 [unnamed-chunk-5] 
11/38                   
12/38 [unnamed-chunk-6] 
13/38                   
14/38 [unnamed-chunk-7] 
15/38                   
16/38 [unnamed-chunk-8] 
17/38                   
18/38 [unnamed-chunk-9] 
19/38                   
20/38 [unnamed-chunk-10]
21/38                   
22/38 [unnamed-chunk-11]
23/38                   
24/38 [unnamed-chunk-12]
25/38                   
26/38 [unnamed-chunk-13]
27/38                   
28/38 [unnamed-chunk-14]
29/38                   
30/38 [unnamed-chunk-15]




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/cell-counts.rmd



1/4                   
2/4 [unnamed-chunk-35]
3/4                   
4/4 [unnamed-chunk-36]
31/38                   
32/38 [unnamed-chunk-16]
33/38                   
34/38 [unnamed-chunk-17]
35/38                   
36/38 [unnamed-chunk-18]




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/genotype-concordance.rmd



1/9                   
2/9 [unnamed-chunk-42]
3/9                   
4/9 [unnamed-chunk-43]
5/9                   
6/9 [unnamed-chunk-44]
7/9                   
8/9 [unnamed-chunk-45]
9/9                   
37/38                   
38/38 [unnamed-chunk-19]


output file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/PHD/EBB_methylation/meffil_data/qc-report.highconc.md




### Remove bad samples

In [19]:
### qc parameters
qc.parameters <- 
meffil.qc.parameters(
    meth.unmeth.outlier.sd                = 5,
	beadnum.samples.threshold             = 0.1,
	detectionp.samples.threshold          = 0.1,
	detectionp.cpgs.threshold             = 0.1, 
	beadnum.cpgs.threshold                = 0.1,
	sex.outlier.sd                        = 5,
	snp.concordance.threshold             = 0.9,
	sample.genotype.concordance.threshold = 0.75
)

In [20]:
### Load meffil objects
# load('meffil_data/qc.objects.Robj')
# load('meffil_data/qcsummary.Robj')

In [21]:
## check
outlier <- qc.summary$bad.samples
table(outlier$issue)
index <- outlier$issue %in% c("Control probe (dye.bias)", 
                              "Methylated vs Unmethylated",
                              "X-Y ratio outlier",
                              "Low bead numbers",
                              "Detection p-value",
                              "Sex mismatch",
                              "Genotype mismatch",
                              "Control probe (bisulfite1)",
                              "Control probe (bisulfite2)")

outlier <- outlier[index,]


Control probe (spec1.ratio) 
                          1 

In [22]:
cat('# of samples before QC:', length(qc.objects), '\n')
cat('# of samples after genotype concordance QC:', length(qc.objects.conc), '\n')
qc.objects <- meffil.remove.samples(qc.objects.conc, outlier$sample.name)
cat('# of samples after full QC:', length(qc.objects))
save(qc.objects,file="meffil_data/qc.objects.clean.Robj")

# of samples before QC: 136 
# of samples after genotype concordance QC: 126 
# of samples after full QC: 126

In [23]:
### Rerun QC summary on clean dataset
qc.summary <- meffil.qc.summary(qc.objects, parameters=qc.parameters, genotypes = genotypes)
save(qc.summary, file = "meffil_data/qcsummary.clean.Robj")

[meffil.qc.summary] Tue Nov  4 09:57:45 2025 Sex summary TRUE 
[meffil.qc.summary] Tue Nov  4 09:57:45 2025 Meth vs unmeth summary 
[meffil.qc.summary] Tue Nov  4 09:57:45 2025 Control means summary 
[meffil.qc.summary] Tue Nov  4 09:57:45 2025 Sample detection summary 
[meffil.qc.summary] Tue Nov  4 09:58:00 2025 CpG detection summary 
[meffil.qc.summary] Tue Nov  4 09:58:01 2025 Sample bead numbers summary 
[meffil.qc.summary] Tue Nov  4 09:58:17 2025 CpG bead numbers summary 
[meffil.qc.summary] Tue Nov  4 09:58:18 2025 Cell count summary 
[meffil.qc.summary] Tue Nov  4 09:58:18 2025 Genotype concordance 


In [24]:
## new qc report
meffil.qc.report(qc.summary, output.file="meffil_data/qc-report.clean.html")

[meffil.qc.report] Tue Nov  4 09:58:31 2025 Writing report as html file to meffil_data/qc-report.clean.html 




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/qc-report.rmd



1/38                   
2/38 [unnamed-chunk-1] 
3/38                   
4/38 [unnamed-chunk-2] 
5/38                   
6/38 [unnamed-chunk-3] 
7/38                   
8/38 [unnamed-chunk-4] 
9/38                   
10/38 [unnamed-chunk-5] 
11/38                   
12/38 [unnamed-chunk-6] 
13/38                   
14/38 [unnamed-chunk-7] 
15/38                   
16/38 [unnamed-chunk-8] 
17/38                   
18/38 [unnamed-chunk-9] 
19/38                   
20/38 [unnamed-chunk-10]
21/38                   
22/38 [unnamed-chunk-11]
23/38                   
24/38 [unnamed-chunk-12]
25/38                   
26/38 [unnamed-chunk-13]
27/38                   
28/38 [unnamed-chunk-14]
29/38                   
30/38 [unnamed-chunk-15]




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/cell-counts.rmd



1/4                   
2/4 [unnamed-chunk-35]
3/4                   
4/4 [unnamed-chunk-36]
31/38                   
32/38 [unnamed-chunk-16]
33/38                   
34/38 [unnamed-chunk-17]
35/38                   
36/38 [unnamed-chunk-18]




processing file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/R-packages/meffil/reports/genotype-concordance.rmd



1/9                   
2/9 [unnamed-chunk-42]
3/9                   
4/9 [unnamed-chunk-43]
5/9                   
6/9 [unnamed-chunk-44]
7/9                   
8/9 [unnamed-chunk-45]
9/9                   
37/38                   
38/38 [unnamed-chunk-19]


output file: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/PHD/EBB_methylation/meffil_data/qc-report.clean.md




In [25]:
sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-conda-linux-gnu
Running under: Rocky Linux 9.5 (Blue Onyx)

Matrix products: default
BLAS/LAPACK: /exports/cmvm/eddie/smgphs/groups/Quantgen/Users/vasilis/PHD/jupiter-setup/envs/jpt/lib/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.1.4           readxl_1.4.5          meffil_1.5.1          preprocessCore_1.70.0 SmartSVA_0.1.3        RSpectra_0.16-2       isva_1.9              JADE_2.0-4           
 [9] 