Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kmerize with hg19 complains of "long vectors not supported yet" #8

Open
dimkonstanto opened this issue Jun 1, 2018 · 3 comments
Open
Labels

Comments

@dimkonstanto
Copy link

Greetings,

I am running the mappable function for hg19 genome:

library(kmap)
library(BiocParallel)

# either
mappable.regions<-mappable("hg19",kmer=50,BPPARAM=MulticoreParam(workers = 1))

# or
mappable.regions<-mappable("hg19",kmer=50)

But finally I get the following error:

INFO [2018-06-01 15:09:12] Removing non-standard DNA bases
INFO [2018-06-01 15:26:38] Chopping into 50-mers
Error in .Call2("valid_Ranges", x_start, x_end, x_width, PACKAGE = "IRanges") :
long vectors not supported yet: memory.c:3451

Any idea what might has caused such an error?

@omsai omsai added the bug label Jun 1, 2018
@omsai
Copy link
Contributor

omsai commented Jun 1, 2018

Yeah, I need to architecturally overhaul how the storage is managed to workaround R's integer limit. This might be similar to #6

At the minute, hg19 can't be kmerized because R has a hard limit on vector maximum size as 2.147e+9 (see https://stackoverflow.com/a/21142236). As the stackoverflow answer explains, when you access vectors using the array notation, underneath the covers the indexes use integers, but hg19 is > 3 billion basepairs:

> suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg19))
> message(format(2^31-1, big.mark = ","))
2,147,483,647
> message(format(sum(as.numeric(elementNROWS(BSgenome.Hsapiens.UCSC.hg19))), big.mark = ","))
3,137,161,264
> sum(elementNROWS(BSgenome.Hsapiens.UCSC.hg19))
[1] NA
Warning message:
In sum(elementNROWS(BSgenome.Hsapiens.UCSC.hg19)) :
  integer overflow - use sum(as.numeric(.))
>

Thanks for the feedback!

@omsai omsai changed the title Memory error? Trying to kmerize hg19 complains of "long vectors not supported yet" Jun 1, 2018
@omsai omsai changed the title Trying to kmerize hg19 complains of "long vectors not supported yet" kmerize with hg19 complains of "long vectors not supported yet" Jun 1, 2018
@omsai
Copy link
Contributor

omsai commented Jun 4, 2018

Actually, the architectural bug about long vectors was fixed upstream. Did a bit more searching for "long vectors bioconductor" and found this issue was fixed in R-3.4 devel:
https://support.bioconductor.org/p/101439/
The workaround for R 3.4 of run useDevel(), as Aaron Lun suggests in the link above, no longer works for me; probably because the newer Bioconductor devel was released for R 3.5:

> version$version.string
[1] "R version 3.4.2 (2017-09-28)"
> BiocInstaller::useDevel()
Error: 'devel' version not available

This leaves upgrading to R-3.5:

# In R 3.5:
source("https://bioconductor.org/biocLite.R")
biocLite(c("devtools", "BSgenome.Hsapiens.UCSC.hg19"))
devtools::install_github("coregenomics/kmap", repos = BiocInstaller::biocinstallRepos())
library(kmap)
library(BiocParallel)
mappable_regions <- mappable("hg19", kmer = 50, BPPARAM = SerialParam())

I haven't tried testing this yet because my lab machine keeps running out of memory even with the single core. I'm in the process of installing R 3.5 and trying kmap on our university cluster and will let you know.

omsai added a commit that referenced this issue Jun 11, 2018
kmerize():

- slidingWindows() in R 3.5.0 / Bioconductor 3.7 natively produces
  GRangesList from GRanges input and no longer requires higher level
  BPPARAM parallelization.

gr_masked():

- Coercion to RangesList no longer supported or necessary; can
  directly coerce to IRangesList.
- Drop esoteric ir2gr() function in lieu of lower memory footprint
  expand_rle().
@omsai
Copy link
Contributor

omsai commented Jun 12, 2018

Fault persists with R 3.5 and running kmap on the university cluster:

INFO [2018-06-11 23:24:38] Removing non-standard DNA bases
INFO [2018-06-11 23:33:29] Chopping into 50-mers
Error in .Call2("Ranges_validate", x_start, x_end, x_width, PACKAGE = "IRanges") : 
  long vectors not supported yet: memory.c:3486
Calls: mappable ... anyStrings -> isTRUE -> validityMethod -> valid.func -> .Call2

Will post to the bioc-devel mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants