Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in CountPeaks() #35

Closed
fciamponi opened this issue Jul 31, 2021 · 5 comments
Closed

Error in CountPeaks() #35

fciamponi opened this issue Jul 31, 2021 · 5 comments

Comments

@fciamponi
Copy link

fciamponi commented Jul 31, 2021

Hello Sierra team,

I was hoping to get some help regarding an issue (see below) that I'm having when running Sierra CountPeaks with a specific sample (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR13043563 and SRR13043565). The pipeline works fine with both the vignette dataset and with other samples from the same experiment/patient (ex. SRR13043564).

I've tried some of the common suggestions found in other issues (ex. checking my gtf annotations, barcode file, bam file index and etc...) but so far nothing worked.

I'm using cellranger's annotation as a reference (refdata-gex-GRCh38-2020-A) and mapped my samples with STAR aligner (since the original experiment is MARS-seq) and followed STARsolo guidelines to make it as close to cellranger as possible.

Thanks in advance for any help you can provide.

Best regards,
Felipe

Error in {: task 32 failed - "1 elements in value to replace 0 elements"
Traceback:

  1. CountPeaks(peak.sites.file = peak.merge.output.file, gtf.file = reference.file,
    . bamfile = bam.file, whitelist.file = whitelist.bc.file, output.dir = outfile,
    . countUMI = TRUE, ncores = 16)
  2. foreach::foreach(each.chr = chr.names, .combine = "rbind", .packages = c("magrittr")) %dopar%
    . {
    . mat.per.chr <- c()
    . message("Processing chr: ", each.chr)
    . for (strand in c(1, -1)) {
    . message(" and strand ", strand)
    . isMinusStrand <- if (strand == 1)
    . FALSE
    . else TRUE
    . peak.sites.chr <- dplyr::filter(peak.sites, Chr ==
    . each.chr & Strand == strand) %>% dplyr::select(Gene,
    . Chr, Fit.start, Fit.end, Strand)
    . peak.sites.chr$Fit.start <- as.integer(peak.sites.chr$Fit.start)
    . peak.sites.chr$Fit.end <- as.integer(peak.sites.chr$Fit.end)
    . peak.sites.chr <- dplyr::filter(peak.sites.chr, Fit.start <
    . Fit.end)
    . if (nrow(peak.sites.chr) == 0) {
    . next
    . }
    . isMinusStrand <- if (strand == 1)
    . FALSE
    . else TRUE
    . which <- GenomicRanges::GRanges(seqnames = each.chr,
    . ranges = IRanges::IRanges(1, max(peak.sites.chr$Fit.end)))
    . param <- Rsamtools::ScanBamParam(tag = c(CBtag, UMItag),
    . which = which, flag = Rsamtools::scanBamFlag(isMinusStrand = isMinusStrand))
    . aln <- GenomicAlignments::readGAlignments(bamfile,
    . param = param)
    . nobarcodes <- which(unlist(is.na(GenomicRanges::mcols(aln)[CBtag])))
    . noUMI <- which(unlist(is.na(GenomicRanges::mcols(aln)[UMItag])))
    . to.remove <- dplyr::union(nobarcodes, noUMI)
    . if (length(to.remove) > 0) {
    . aln <- aln[-to.remove]
    . }
    . whitelist.pos <- which(unlist(GenomicRanges::mcols(aln)[CBtag]) %in%
    . whitelist.bc)
    . aln <- aln[whitelist.pos]
    . if (countUMI) {
    . GenomicRanges::mcols(aln)$CB_UB <- paste0(unlist(GenomicRanges::mcols(aln)[CBtag]),
    . "_", unlist(GenomicRanges::mcols(aln)[UMItag]))
    . uniqUMIs <- which(!duplicated(GenomicRanges::mcols(aln)$CB_UB))
    . aln <- aln[uniqUMIs]
    . }
    . aln <- GenomicRanges::split(aln, unlist(GenomicRanges::mcols(aln)[CBtag]))
    . polyA.GR <- GenomicRanges::GRanges(seqnames = peak.sites.chr$Chr,
    . IRanges::IRanges(start = peak.sites.chr$Fit.start,
    . end = as.integer(peak.sites.chr$Fit.end)))
    . n.polyA <- length(polyA.GR)
    . barcodes.gene <- names(aln)
    . res <- sapply(barcodes.gene, function(x) GenomicRanges::countOverlaps(polyA.GR,
    . aln[[x]]))
    . res.mat <- matrix(0L, nrow = n.polyA, ncol = n.bcs)
    . res.mat[, match(barcodes.gene, whitelist.bc)] <- res
    . mat.per.strand <- Matrix::Matrix(res.mat, sparse = TRUE)
    . polyA.ids <- paste0(peak.sites.chr$Gene, ":", peak.sites.chr$Chr,
    . ":", peak.sites.chr$Fit.start, "-", peak.sites.chr$Fit.end,
    . ":", peak.sites.chr$Strand)
    . rownames(mat.per.strand) <- polyA.ids
    . if (is.null(mat.per.chr)) {
    . mat.per.chr <- mat.per.strand
    . }
    . else {
    . mat.per.chr <- rbind(mat.per.chr, mat.per.strand)
    . }
    . }
    . return(mat.per.chr)
    . }
  3. e$fun(obj, substitute(ex), parent.frame(), e$data)
@fciamponi fciamponi changed the title Error in PeakCount() Error in CountPeaks() Jul 31, 2021
@rj-patrick
Copy link
Contributor

Hi @fciamponi,

Thanks for the information. The easiest thing would be if we could reproduce the error on our end. If you were to take a subset of the bam does it still produce the error? If so, would you please send it (along with the peak whitelist files) and we can work out what's going on.

Cheers,
Ralph

@fciamponi
Copy link
Author

fciamponi commented Aug 7, 2021

Thanks for the quick reply @reprobate.

Yes, I reproduced the error with a small subset of the original bam file (with just 10% of total reads). I'm including the sample here, along with the barcodes and merged peaks information.

However, I also noticed that the same error occured when I create randomized subsets for previsously working .bam files (such as the case with SRR13043564). I then proceeded to create a subset that contains 20% of the reads from each cell (instead of drawing 10~50% of reads randomly from the original bam file) but the problem persisted.

I've uploaded the files to a google drive folder (https://drive.google.com/drive/folders/1IhiZJ_86R2MeEgdT_M6Jaxnz_qssAHbK?usp=sharing). In this case, SRR13043564 runs without issues but SRR13043563 gives the error I mentioned in the opening comment.

In both cases, my command line was (just switching replacing 63-to-64 when necessary):

dir.create('./peakCountsTest/',recursive = TRUE)

whitelist.bc.file <- ./'CellBarcodes.COHEN2021.unique.txt'
reference.file <- './genes.gtf'
bam.file <- '../SRR13043564.bam'
sample <- 'SRR13043564'

outfile <- paste0('./peakCountsTest/',sample,'.counts')

peak.merge.output.file <- "./mergedPeaks.txt"

CountPeaks(peak.sites.file = peak.merge.output.file,
gtf.file = reference.file,
bamfile = bam.file,
whitelist.file = whitelist.bc.file,
output.dir = outfile,
countUMI = TRUE,
ncores = 16)

Best,
Felipe

PS. Sorry for taking so long to reply, it was a very busy week.

@rj-patrick
Copy link
Contributor

Hi Felipe,

Thanks for the data, it was very helpful for debugging. The issue was traced to an evidently rare situation where there was a single peak called in a chromosome, but with no barcodes in the whitelist. I've added a check for this situation which should fix it - I'm now able to run CountPeaks on your example data without error. Let me know if you have any more issues.

Cheers,
Ralph

@fciamponi
Copy link
Author

Thanks, it worked perfectly now!

Sorry for the inconvenience.

Best,
Felipe

@rj-patrick
Copy link
Contributor

No worries, thanks for helping us find that bug!

Cheers,
Ralph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants