-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Unite function and large dataset #63
Comments
do you get the same error with destrand=TRUE?
…On Wed, Mar 15, 2017 at 8:34 PM, Andrew Johnston ***@***.***> wrote:
Hi Akalin,
I'm getting the following error when uniting my sample with the following
code:
meth<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)
## Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 216308346 rows; more than 60136491 = nrow(x)+nrow(i).
Check for duplicate key values in i each of which join to the same
group in x over and over again. If that’s ok, try by=.EACHI to run
j for each group to avoid the large allocation. If you are sure you
wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please
search for this error message in the FAQ, Wiki, Stack Overflow and
datatable-help for advice.
I've checked to make sure that none of my samples have duplicate chr and
coordinates using the following 1 liner
$cut -f 2,3 ${dir}BSTag-Sample-${SampleName}-reprep_CpG.methylKit| sort |
uniq -cd
I've also tried replicating the problem with only using 2-3 samples, but I
received no error. I then tried to subset my 17 samples to just the first
1,000,000 lines; this received no error. I tried again when I used the
first 5 million, and also received no error. I tried 10 million, and I got
the same error (albeit less duplicates mentioned). I wanted to shrink the
file sizes down so I subset the file to only rows>5,000,000 and less then
10,000,000; this returned the error:
## Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 11620518 rows; more than 7736026 = nrow(x)+nrow(i).
Check for duplicate key values in i each of which join to the same
group in x over and over again. If that’s ok, try by=.EACHI to run
j for each group to avoid the large allocation. If you are sure you
wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please...
I tried to subset the files even further, such as 5,000,000 < rows <
7,500,000 and 7,500,000 < rows < 10,000,000, but this did not replicate the
error. So unfortunately, I'm making available some large files which
replicate the error.
The following is the code that I used (using the most recent github
install of methylKit):
suppressPackageStartupMessages({ library(methylKit)
})
library(RColorBrewer)
library(gplots)
library(knitr)
file.list=list("BS-TAG-Sample-77_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-78_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-79_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-80_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-81_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-82_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-83_CpG_percentFix_5000000.methylKit",
"BSTag-Sample-84-reprep_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-85_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-86_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-87_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-88_CpG_percentFix_5000000.methylKit",
"BSTag-Sample-89-reprep_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-90_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-91_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-92_CpG_percentFix_5000000.methylKit",
"BS-TAG-Sample-93_CpG_percentFix_5000000.methylKit")
familyObj=methRead(file.list, sample.id=list("gm77", "gm78", "gm79", "gm80",
"gm81", "gm82", "gm83", "gm84", "gm85", "gm86", "gm87", "gm88",
"gm89", "gm90", "gm91","gm92", "gm93"), assembly="hg38_decoy_EBV",
treatment=rep(0,17), context="CpG", mincov = 3)
meth_debug<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)
# I've also tried
meth_debug<-unite(object = familyObj, destrand=TRUE)
I've placed the files in the following google drive (sorry about the size
of the files):
https://drive.google.com/drive/folders/0B1bAk0VjZmbVMURzOVJrWXVraUE?
usp=sharing
Thank you for your help,
Andrew
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#63>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAm9EbqAsKN-ctsib9cUPLCH9S_JdgHvks5rmD00gaJpZM4MebJy>
.
|
I ran every unite command with destrand=TRUE. Would you like me to try destrand=FALSE? or am I missing something here? |
yes, could you do that? Also make sure the files are tab separated,
otherwise the text you are running for uniqueness will not work. If you
don't get an error with destrand=FALSE, the duplication might be introduced
internally and could be a bug. If you still get the error with
destrand=FALSE, setting by=.EACHI might help but that has to be set by us
in the function. you can't pass it as an argument probably
…On Wed, Mar 15, 2017 at 11:29 PM, Andrew Johnston ***@***.***> wrote:
I ran every unite command with destrand=TRUE. Would you like me to try
destrand=FALSE? or am I missing something here?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#63 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAm9EST6iXqcZbMoUHv2hoRh10hIGv-Zks5rmGY_gaJpZM4MebJy>
.
|
Sorry for the delay, HPC was shutdown for maintenance for a few days. The files are all tab separated. I am still getting the error when destrand=F. I haven't tried setting by=.EACHI, since you said that it will have to be set by you in the function. If you think it'll help, of course I'll try it with destrand=F. I tried with destrand = T, and it did not work. |
trying with destrand=F will help us understand where the bug might be
coming from, or if it is on our side or data.table. If you could try it and
let me know that would be great
Best,
Altuna
…On Wed, Mar 22, 2017 at 3:17 PM, Andrew Johnston ***@***.***> wrote:
Sorry for the delay, HPC was shutdown for maintenance for a few days. The
files are all tab separated. I am still getting the error when destrand=F.
I haven't tried setting by=.EACHI, since you said that it will have to be
set by you in the function. If you think it'll help, of course I'll try it
with destrand=F. I tried with destrand = T, and it did not work.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#63 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAm9EX0KO3XTm7QsM345gthUdbq0mGQ6ks5roS1dgaJpZM4MebJy>
.
|
Hey Altuna, Sorry for being lazy haha. I did run it as:
|
Hey Altuna, Any other suggestions for how I can help to debug? Thanks, |
If this is a data.table error and subsetting the data to smaller chunks works, you can try to use methylDB objects, these are tabix files essentially. during the unite function, the objects are processed chromosome by chromosome. If you get the error there, then there might be really a duplicate, but this is unlikely. Using methylDB objects might just work for you |
Thanks so much for the help. Of course it was something seemingly innocuous. Turns out there were commented out lines that preceded each chromosome, removed them and it worked! |
@AJEinstein , I am having the same problem! Can you please describe how you removed each of the commented out lines? Will the .bgz and .bgz.tbi files correspond correctly with the .txt file anymore if I remove these lines? And were you referring to these lines, that begin with a #?(I pasted my example below). #Date:2022-10-12 22:46:33 |
Hi @krof7d, the tabix header (your example) you are referring to was introduced in a later version of methylKit. |
Thank you, I see. Well I did try to delete the first 9 commented-out lines from every .txt file anyways, and I still received the same error message. The error message being: "Reading file. I have tried the unite function every which way (with min.per and without it, and with destrand and without it as well). See below for all of my details: I am using bam sorted and indexed files generated from bismark aligner. The data comes from whole-genome enzymatic rxn methylation sequencing. I have the 3 basic scripts I have performed on all 170 of my files, the 3rd script being the problem where I can not get the unite function to run.
file.list <- list( library(methylKit) assembly="hg38",
#!/bin/env Rscript file.list <- list( etc...)myobjDB = methRead( etc...), etc... ),context = "CpG",
#!/bin/env Rscript library(methylKit) file.list <- list( myobj = methRead( length(myobj) meth=unite(myobj, allow.cartesian=TRUE, destrand=FALSE) myDiff=calculateDiffMeth(meth, mc.cores=2) myDiff10p.hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper") myDiff10p.hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo") myDiff10p=getMethylDiff(myDiff,difference=10,qvalue=0.05) ----------------------------------------------------Again, here is the error message I get : |
Hi @krof7d, In your third script where you want to unite the samples, you are supposed to read in the files with the I quickly checked with some samples and unfortunately, we do not break on loading the uncompressed tabix files. Even uniting the samples worked in my case.
Thus to fix your unite script, please use the files with the ".txt.bgz" extension. |
@alexg9010 Okay, I tried using the .txt.bgz files and this error message was returned: Error in (function (classes, fdef, mtable) : Plus, when I just look at the length of my object, it says zero: Received list of locations.
So there is definitely something wrong with the .txt.bgz files? Should I just try rerunning everything from the beginning? But processbismarkaln takes a really long time, so I do not have the time to rerun that one from the beginning again. |
Hi @krof7d, This is weird, could you please show the code that you used? Also, could you please check that the tabix files actually still exist? Best, |
In your call to methRead, did you include the argument |
Okay, sorry about that, so I have now included dbtype = "tabix" into the same exact code from above, and this was my error message: checking wether tabix file already exists: merging tabix files ... So then I tried the same exact code from above (w/ the min.per group this time, plus I also added destrand=TRUE because I did not realize I still had it set to false). So here is what the code looked like: #!/bin/env Rscript #setwd("/data/methylseq_bam.sort.bai/processbismarkaln/methylDB") library(methylKit) file.list <- list( myobj = methRead( length(myobj) #meth=unite(myobj, allow.cartesian=TRUE, destrand=FALSE) myDiff=calculateDiffMeth(meth, mc.cores=2) myDiff10p.hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper") myDiff10p.hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo") myDiff10p=getMethylDiff(myDiff,difference=10,qvalue=0.05) And here was the error message I received: checking wether tabix file already exists: merging tabix files ... Here was the output of length(myobj) (so the myobj seemed to be successful). And it also looked like a methylbase file was starting to be made-- see pic pasted below. Here is what the methylbase file looked like in notepad after I unzipped it: I also tried running the same exact code from above in this same comment with destrand = false. That job quickly did not work, and here was my error message: checking wether tabix file already exists: merging tabix files ... |
The error messages seem to indicate that there are no CpGs overlapping between any two samples. But this is quite unexpected with this large number of samples. In general, this could be due low coverage, because in the methRead function the default minimum coverage ( min.cov ) is 10 reads per base, thus all bases with lower coverage are omitted. Setting this to a lower value should increase the number of bases. How many CpGs are roughly covered per Sample and how does the coverage distribution of your samples look like using the function getCoveragestats ? |
@alexg9010 Here was the error message that was returned: Once I get this part working again. I will definitely check on the methylation stats and get back to you. Thank you so much for your help. And here was my code (same as we had finally working from last comments) Here was the code I used: file.list <- list( myobj = methRead( |
Hi @krof7d, Sorry for the belated reply. The error occurs when trying to index the tabix file and indicates that the corresponding file is not compressed using bgzip. |
My files are bgzipped. I don't get if you were saying 1) you knew my files were bgzipped already, and that I had to uncompress then compress them again anyways to get it to work OR 2) you thought my files were not bgzipped but instead compressed a different way?? Remember, the same exact methread command/code that I had run before and had worked on the same exact .txt.bgz files, is not working anymore. Nothing changed with my .txt.bgz files. Are you sure there is not something wrong with the methylkit code now? For example, I noticed that you raised a ticket: #272 to verify that the file is tabix-- are you sure something was not messed up with the methylkit code in doing so? I did perform re-bgzipping from the .txt files anyways (if this is what you meant-- I couldn't find an uncompress option using bgzip). Here was how I did it: library(Rsamtools) Then, when I use the newly bgzipped files, I am still getting the same exact error/having the same exact problem as my last comment. (the code would be the exact same as my last comment). So therefore, I cannot even read in files anymore to get a myobj to be checking the coverage stats for figuring out why the unite function doesn't work for me (Re: your comment from 7 days ago). Thanks, |
Hi @krof7d, The error message you posted before indicates something wrong with the compression of the file
My first assumption was that it got only gzipped or not compressed at all, as a side effect of the unzipping and checking with notepad. That is why I asked you to re-compress, however it might be a different issue. Okay, Could you please paste here the output of the following command:
|
@alexg9010 Nevermind, the methread command is working again for me! I think it was due to me switching between R studio on my laptop and my work's cluster network when I need to actually submit the job but then not changing the file location correctly in my file.list command:
I will continue to work on checking the methylation stats now, and let you know what they are so we can figure out why the unite function is not working for me. I'll leave the comment I had previously, before I realized what was wrong, under the dotted line below, if for some reason you still care to read output of sapply:
Not sure why the 'class' column changes from "file" to "gzfile" for the same file, /data/methylseq_bam.sort.bai/processbismarkaln/methylDB/MRS03.txt.bgz, between 1 & 2. |
Thanks for providing this update! |
@alexg9010 |
Hi @KristenKrolick, |
Hi Akalin,
I'm getting the following error when uniting my sample with the following code:
I've checked to make sure that none of my samples have duplicate chr and coordinates using the following 1 liner
$cut -f 2,3 ${dir}BSTag-Sample-${SampleName}-reprep_CpG.methylKit| sort | uniq -cd
I've also tried replicating the problem with only using 2-3 samples, but I received no error. I then tried to subset my 17 samples to just the first 1,000,000 lines; this received no error. I tried again when I used the first 5 million, and also received no error. I tried 10 million, and I got the same error (albeit less duplicates mentioned). I wanted to shrink the file sizes down so I subset the file to only rows>5,000,000 and less then 10,000,000; this returned the error:
I tried to subset the files even further, such as 5,000,000 < rows < 7,500,000 and 7,500,000 < rows < 10,000,000, but this did not replicate the error. So unfortunately, I'm making available some large files which replicate the error.
The following is the code that I used (using the most recent github install of methylKit):
I've placed the files in the following google drive (sorry about the size of the files):
https://drive.google.com/drive/folders/0B1bAk0VjZmbVMURzOVJrWXVraUE?usp=sharing
Thank you for your help,
Andrew
The text was updated successfully, but these errors were encountered: