In [1]:
library(SnapATAC)
library(Seurat)
library(Signac)
library(genomation)
library(GenomicRanges)
library(parallel)
library(foreach)

Loading required package: Matrix

Loading required package: rhdf5

“no DISPLAY variable so Tk is not available”
Loading required package: SeuratObject

Loading required package: sp

‘SeuratObject’ was built under R 4.3.2 but the current version is
4.3.3; it is recomended that you reinstall ‘SeuratObject’ as the ABI
for R may have changed

‘SeuratObject’ was built with package ‘Matrix’ 1.6.3 but the current
version is 1.6.5; it is recomended that you reinstall ‘SeuratObject’ as
the ABI for ‘Matrix’ may have changed


Attaching package: ‘SeuratObject’


The following object is masked from ‘package:base’:

    intersect


Loading required package: grid

“replacing previous import ‘Biostrings::pattern’ by ‘grid::pattern’ when loading ‘genomation’”
Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following object is masked from ‘package:SeuratObject’:

    intersect


The following objects are masked from ‘package:stats’:

  

Specify file path

In [2]:
gene_gtf_path = "/maps/projects/ralab/data/genome/hg38/gencode.v43.chr_patch_hapl_scaff.annotation.gtf"
abc_genes_path = "/maps/projects/ralab_nnfc-AUDIT/people/lpm537/software/scE2G_pipeline/241203/scE2G/ENCODE_rE2G/ABC/reference/hg38/CollapsedGeneBounds.hg38.TSS500bp.bed"
path.matrix.atac_count = "/maps/projects/ralab_nnfc-AUDIT/people/lpm537/software/scE2G_pipeline/240508/sc-E2G/test/results_K562_Xu/K562/Kendall/atac_matrix.csv.gz"
path.matrix.rna_count = "/maps/projects/ralab_nnfc-AUDIT/people/lpm537/project/E2G/analysis/E2G_240503/data/K562_Xu/1.prepare_data/1.seurat_pipeline.240507/rna_count_matrix.csv.gz"
dir.output = "/maps/projects/ralab_nnfc-AUDIT/people/lpm537/project/E2G/analysis/E2G_240503/data/K562_Xu/3.Genome_wide_prediction/SnapATAC.default/SnapATAC.250212/"

In [3]:
n.cores = 8

Import ATAC matrix

In [4]:
matrix.atac_count = read.csv(path.matrix.atac_count,
                             row.names = 1,
                             check.names = F)
matrix.atac_count = Matrix(as.matrix(matrix.atac_count), sparse = TRUE)
matrix.atac = BinarizeCounts(matrix.atac_count)
rm(matrix.atac_count)

Import RNA matrix

In [5]:
matrix.rna_count = read.csv(path.matrix.rna_count,
                            row.names = 1,
                            check.names = F)
matrix.rna_count = Matrix(as.matrix(matrix.rna_count), sparse = TRUE)
matrix.rna_count = matrix.rna_count[,colnames(matrix.atac)]
matrix.rna_count = matrix.rna_count[rowSums(matrix.rna_count) > 0,]
matrix.rna = NormalizeData(matrix.rna_count)
rm(matrix.rna_count)

Map gene names

In [6]:
extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- unlist(strsplit(gtf_attributes, " "))
  if(att_of_interest %in% att){
    return(gsub("\"|;","", att[which(att %in% att_of_interest)+1]))
  } else {
    return(NA)}
}
map_gene_names <- function(rna_matrix, gene_gtf_path, abc_genes_path){
    library(dplyr)
    library(data.table)
    
	gene_ref <- fread(gene_gtf_path, header = FALSE, sep = "\t") %>%
		setNames(c("chr","source","type","start","end","score","strand","phase","attributes")) %>%
		dplyr::filter(type == "gene")
	gene_ref$gene_ref_name <- unlist(lapply(gene_ref$attributes, extract_attributes, "gene_name"))
	gene_ref$Ensembl_ID <- unlist(lapply(gene_ref$attributes, extract_attributes, "gene_id"))
	gene_ref <- dplyr::select(gene_ref, gene_ref_name, Ensembl_ID) %>%
		mutate(Ensembl_ID = sub("\\.\\d+$", "", Ensembl_ID)) %>% # remove decimal digits 
		distinct()
	
	abc_genes <- fread(abc_genes_path, col.names = c("chr", "start", "end", "name", "score", "strand", "Ensembl_ID", "gene_type")) %>%
		dplyr::select(name, Ensembl_ID) %>%
		rename(abc_name = name) %>%
		left_join(gene_ref, by = "Ensembl_ID") %>%
		group_by(Ensembl_ID) %>% # remove cases where multiple genes map to one ensembl ID
		filter(n() == 1) %>%
		ungroup()

	gene_key <- abc_genes$abc_name
	names(gene_key) <- abc_genes$gene_ref_name

	# remove genes not in our gene universe	
	row_sub <- intersect(rownames(rna_matrix), names(gene_key)) # gene ref names
	rna_matrix_filt <- rna_matrix[row_sub,] # still gene ref names
	rownames(rna_matrix_filt) <- gene_key[row_sub] # converted to abc names

	return(rna_matrix_filt)
}

In [7]:
matrix.rna.rename = map_gene_names(matrix.rna,gene_gtf_path, abc_genes_path)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:GenomicRanges’:

    intersect, setdiff, union


The following object is masked from ‘package:GenomeInfoDb’:

    intersect


The following objects are masked from ‘package:IRanges’:

    collapse, desc, intersect, setdiff, slice, union


The following objects are masked from ‘package:S4Vectors’:

    first, intersect, rename, setdiff, setequal, union


The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘data.table’


The following objects are masked from ‘package:dplyr’:

    between, first, last


The following object is masked from ‘package:GenomicRanges’:

    shift


The following object is masked from ‘package:IRanges’:

    shift


The following objects are masked

Create SnapATAC object

In [8]:
df.peaks = as.data.frame(do.call(rbind,strsplit(rownames(matrix.atac),"-")))
colnames(df.peaks) = c("chr","start","end")
bed.peaks = makeGRangesFromDataFrame(df.peaks)

In [9]:
x.sp = createSnapFromPmat(mat = t(matrix.atac), 
                          barcodes = colnames(matrix.atac), 
                          peaks = bed.peaks)

In [10]:
x.sp@gmat = t(matrix.rna.rename)

In [11]:
x.sp

number of barcodes: 7821
number of bins: 0
number of genes: 18235
number of peaks: 157600
number of motifs: 0

In [12]:
bed.tss.500 = readGeneric("/maps/projects/ralab_nnfc-AUDIT/people/lpm537/software/scE2G_pipeline/240508/sc-E2G/ENCODE_rE2G/ABC/reference/hg38/CollapsedGeneBounds.hg38.TSS500bp.bed",
                          header = T,
                          keep.all.metadata = T)
bed.tss.500

GRanges object with 20666 ranges and 4 metadata columns:
          seqnames            ranges strand |        name     score
             <Rle>         <IRanges>  <Rle> | <character> <integer>
      [1]     chr1       35831-36331      * |     FAM138A         0
      [2]     chr1       35831-36331      * |     FAM138F         0
      [3]     chr1       68840-69340      * |       OR4F5         0
      [4]     chr1     817120-817620      * |      FAM87B         0
      [5]     chr1     827272-827772      * |   LINC00115         0
      ...      ...               ...    ... .         ...       ...
  [20662]     chrY 21386110-21386610      * |       PRORY         0
  [20663]     chrY 21594416-21594916      * |      TTTY13         0
  [20664]     chrY 22298626-22299126      * |       TTTY5         0
  [20665]     chrY 23198842-23199342      * |        DAZ1         0
  [20666]     chrY 23219206-23219706      * |        DAZ2         0
               Ensembl_ID      gene_type
              <cha

In [13]:
bed.tss.1Mb = resize(bed.tss.500, width=500000, fix="center")
bed.tss.1Mb

GRanges object with 20666 ranges and 4 metadata columns:
          seqnames            ranges strand |        name     score
             <Rle>         <IRanges>  <Rle> | <character> <integer>
      [1]     chr1    -213919-286080      * |     FAM138A         0
      [2]     chr1    -213919-286080      * |     FAM138F         0
      [3]     chr1    -180910-319089      * |       OR4F5         0
      [4]     chr1    567370-1067369      * |      FAM87B         0
      [5]     chr1    577522-1077521      * |   LINC00115         0
      ...      ...               ...    ... .         ...       ...
  [20662]     chrY 21136360-21636359      * |       PRORY         0
  [20663]     chrY 21344666-21844665      * |      TTTY13         0
  [20664]     chrY 22048876-22548875      * |       TTTY5         0
  [20665]     chrY 22949092-23449091      * |        DAZ1         0
  [20666]     chrY 22969456-23469455      * |        DAZ2         0
               Ensembl_ID      gene_type
              <cha

The original predictGenePeakPair function encounter bug when there is only one peak within TSS±500kb.
Modify the predictGenePeakPair function to fix the bug

In [14]:
predictGenePeakPair.modified = function (obj, input.mat = c("pmat", "bmat"), gene.name = NULL, 
    gene.loci = NULL, do.par = FALSE, num.cores = 1) 
{
    cat("Epoch: checking input parameters ... \n", file = stderr())
    if (missing(obj)) {
        stop("obj is missing")
    }
    else {
        if (!is(obj, "snap")) {
            stop("obj is not a snap obj")
        }
    }
    gene.mat = obj@gmat
    if ((x = nrow(gene.mat)) == 1L) {
        stop("cell by gene matrix is empty!")
    }
    if (!(gene.name %in% colnames(gene.mat))) {
        stop("gene.name does not exist in the cell by gene matrix")
    }
    else {
        gene.val = gene.mat[, gene.name]
    }
    input.mat = match.arg(input.mat)
    if (input.mat == "bmat") {
        data.use = obj@bmat
        peak.use = obj@feature
    }
    else if (input.mat == "pmat") {
        data.use = obj@pmat
        peak.use = obj@peak
    }
    else {
        stop("input.mat does not exist in obj")
    }
    if ((x = max(data.use)) > 1L) {
        stop("input matrix is not binarized, run 'makeBinary' first!")
    }
    if ((x = length(peak.use)) == 1L) {
        stop("peak is empty!")
    }
    if ((x = nrow(data.use)) == 1L) {
        stop("input matrix is empty!")
    }
    ncell = nrow(data.use)
    if (!is(gene.loci, "GRanges")) {
        stop("gene.loci is not genomic range object")
    }
    else {
        if ((x = width(gene.loci)) == 0L) {
            stop("length of gene flanking window is 0")
        }
    }
    idy = queryHits(findOverlaps(peak.use, gene.loci))
    if ((x = length(idy)) == 0L) {
        stop("no peaks are found within the gene flanking window")
    }
    else {
        # data.use = data.use[, idy]
        # add ", drop = F" to make sure that data.use is a matrix
        data.use = data.use[, idy, drop = F]
        peak.use = peak.use[idy]
        idy = which(Matrix::colSums(data.use) > 0)
        if ((x = length(idy)) == 0L) {
            stop("no peaks are found within the gene flanking window")
        }
        # data.use = data.use[, idy]
        # add ", drop = F" to make sure that data.use is a matrix
        data.use = data.use[, idy, drop = F]
        peak.use = peak.use[idy]
    }
    cat("Epoch: performing statitical test ... \n", file = stderr())
    if (do.par) {
        if (num.cores > 1) {
            if (num.cores == 1) {
                num.cores = 1
            }
            else if (num.cores > detectCores()) {
                num.cores <- detectCores() - 1
                warning(paste0("num.cores set greater than number of available cores(", 
                  parallel::detectCores(), "). Setting num.cores to ", 
                  num.cores, "."))
            }
        }
        else if (num.cores != 1) {
            num.cores <- 1
        }
        cl <- makeCluster(num.cores, type = "SOCK")
        registerDoSNOW(cl)
        clusterEvalQ(cl, library(stats))
        peaks.id = seq(ncol(data.use))
        clusterExport(cl, c("data.use", "gene.val"), envir = environment())
        opts <- list(preschedule = TRUE)
        clusterSetRNGStream(cl, 10)
        models <- suppressWarnings(llply(.data = peaks.id, .fun = function(t) summary(stats::glm(y ~ 
            x, data = data.frame(y = data.use[, t], x = gene.val), 
            family = binomial(link = "logit")))[["coefficients"]]["x", 
            ], .parallel = TRUE, .paropts = list(.options.snow = opts), 
            .inform = FALSE))
        models <- do.call(rbind, models)
        stopCluster(cl)
    }
    else {
        peaks.id = seq(ncol(data.use))
        models = lapply(peaks.id, function(t) {
            summary(stats::glm(y ~ x, data = data.frame(y = data.use[, 
                t], x = gene.val), family = binomial(link = "logit")))[["coefficients"]]["x", 
                ]
        })
        models <- do.call(rbind, models)
    }
    models[models[, "z value"] < 0, "Pr(>|z|)"] = 1
    peak.use$beta = models[, "Estimate"]
    peak.use$zvalue = models[, "z value"]
    peak.use$stde = models[, "Std. Error"]
    peak.use$Pval = models[, "Pr(>|z|)"]
    peak.use$logPval = -log10(peak.use$Pval)
    cat("Epoch: Done ... \n", file = stderr())
    return(peak.use)
}

Run SnapATAC prediction

In [15]:
gene.use = rownames(matrix.rna.rename[rowSums(matrix.rna.rename) > 0,])
gene.use = gene.use[gene.use %in% bed.tss.1Mb$name]
length(gene.use)

In [16]:
my.cluster <- parallel::makeCluster(
  n.cores,
  type = "PSOCK"
)
doParallel::registerDoParallel(cl = my.cluster)

pairs.E2G =  
  foreach (gene.tmp = gene.use,
           .combine = 'c',
           .packages = c("SnapATAC",
                         "genomation",
                         "GenomicRanges")) %dopar% {
                           bed.tss.1Mb.tmp = bed.tss.1Mb[bed.tss.1Mb$name == gene.tmp]
                           if(length(findOverlaps(bed.peaks, bed.tss.1Mb.tmp)) == 0) {
                             pairs = NULL
                           } else {
                             
                             pairs = predictGenePeakPair.modified(
                               x.sp, 
                               input.mat="pmat",
                               gene.name=gene.tmp, 
                               gene.loci=bed.tss.1Mb.tmp,
                               do.par=FALSE
                             )
                             pairs$TargetGene = gene.tmp
                           }
                           pairs
                           
                         }

parallel::stopCluster(cl = my.cluster)

In [17]:
pairs.E2G

GRanges object with 1041014 ranges and 6 metadata columns:
            seqnames              ranges strand |       beta     zvalue
               <Rle>           <IRanges>  <Rle> |  <numeric>  <numeric>
        [1]     chr1       115493-115961      * |  1.0665967  0.7975380
        [2]     chr1       135076-135269      * |  1.9248347  1.0107396
        [3]     chr1       136330-137277      * |  0.7032392  0.6309758
        [4]     chr1       137692-138124      * |  0.0661497  0.0278072
        [5]     chr1       138272-139644      * | -0.5042218 -0.3458275
        ...      ...                 ...    ... .        ...        ...
  [1041010]     chrX 156009722-156010137      * | -74.075252  -0.021592
  [1041011]     chrX 156016191-156016823      * |   1.005953   0.693722
  [1041012]     chrX 156019811-156020191      * |   2.068836   1.204477
  [1041013]     chrX 156025028-156025591      * |  -1.559864  -0.397004
  [1041014]     chrX 156030021-156030865      * |   0.701188   0.743172
     

In [18]:
pairs.E2G.res = pairs.E2G

Save results

In [19]:
dir.create(dir.output,recursive = T)

In [20]:
saveRDS(pairs.E2G.res,
        paste(dir.output,"pairs.E2G.res.rds",sep = "/"))
pairs.E2G.res

GRanges object with 1041014 ranges and 6 metadata columns:
            seqnames              ranges strand |       beta     zvalue
               <Rle>           <IRanges>  <Rle> |  <numeric>  <numeric>
        [1]     chr1       115493-115961      * |  1.0665967  0.7975380
        [2]     chr1       135076-135269      * |  1.9248347  1.0107396
        [3]     chr1       136330-137277      * |  0.7032392  0.6309758
        [4]     chr1       137692-138124      * |  0.0661497  0.0278072
        [5]     chr1       138272-139644      * | -0.5042218 -0.3458275
        ...      ...                 ...    ... .        ...        ...
  [1041010]     chrX 156009722-156010137      * | -74.075252  -0.021592
  [1041011]     chrX 156016191-156016823      * |   1.005953   0.693722
  [1041012]     chrX 156019811-156020191      * |   2.068836   1.204477
  [1041013]     chrX 156025028-156025591      * |  -1.559864  -0.397004
  [1041014]     chrX 156030021-156030865      * |   0.701188   0.743172
     

In [21]:
df.output = as.data.frame(pairs.E2G.res,row.names = NULL)
colnames(df.output)[1] = "chr"
df.output[,"CellType"] = "K562"
df.output = df.output[,c("chr",
                         "start",
                         "end",
                         "TargetGene",
                         "CellType",
                         "beta",
                         "zvalue",
                         "stde",
                         "Pval",
                         "logPval")]
data.table::fwrite(df.output,
                   file = paste(dir.output,"pairs.E2G.res.tsv.gz",sep = "/"),
                   row.names = F,
                   quote = F,
                   sep = "\t")
df.output

chr,start,end,TargetGene,CellType,beta,zvalue,stde,Pval,logPval
<fct>,<int>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,115493,115961,OR4F5,K562,1.06659666,0.79753800,1.337362,0.42513865,0.371469416
chr1,135076,135269,OR4F5,K562,1.92483470,1.01073960,1.904382,0.31214108,0.505649076
chr1,136330,137277,OR4F5,K562,0.70323918,0.63097575,1.114526,0.52805638,0.277319707
chr1,137692,138124,OR4F5,K562,0.06614971,0.02780723,2.378867,0.97781590,0.009742906
chr1,138272,139644,OR4F5,K562,-0.50422184,-0.34582746,1.458016,1.00000000,0.000000000
chr1,15894,16509,OR4F5,K562,-70.65753028,-0.02599705,2717.905418,1.00000000,0.000000000
chr1,17271,17713,OR4F5,K562,-72.44242573,-0.02666247,2717.018805,1.00000000,0.000000000
chr1,180712,181916,OR4F5,K562,0.60861363,0.38847602,1.566670,0.69766380,0.156353811
chr1,183179,184664,OR4F5,K562,-1.95632595,-0.66503496,2.941689,1.00000000,0.000000000
chr1,186287,187021,OR4F5,K562,-71.59753613,-0.02634713,2717.470118,1.00000000,0.000000000


In [22]:
sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.10 (Ootpa)

Matrix products: default
BLAS/LAPACK: /maps/projects/ralab/people/lpm537/software/anaconda3/envs/Notebook_E2G_240505/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Copenhagen
tzcode source: system (glibc)

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] data.table_1.15.2    dplyr_1.1.4          foreach_1.5.2       
 [4] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1  IRanges