Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exportBins to seg/vcf when one or zero CNV are called #102

Open
helaersr opened this issue Jan 27, 2022 · 9 comments · May be fixed by #106
Open

exportBins to seg/vcf when one or zero CNV are called #102

helaersr opened this issue Jan 27, 2022 · 9 comments · May be fixed by #106

Comments

@helaersr
Copy link

Hello,

There is a problem when outputing seg and vcf file and the total CNV called is zero or one.

There is an example of seg file with one CNV:

rhelaers@ddus:~/qdnaseq$ cat CLP-1271-3.seg
x
/home/users/rhelaers/qdnaseq/CLP-1271-3.seg
21
14000001
25000000
21
0.17

And the vcf:

rhelaers@ddus:~/qdnaseq$ cat CLP-1271-3.vcf
##fileformat=VCFv4.2
##source=QDNAseq-1.30.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
x
21
14000001
.
<DIP>
<DUP>
1000
PASS
SVTYPE=DUP;END=25000000;SVLEN=11000000;BINS=21;SCORE=1;LOG2CNT=0.17
GT
0/1

The code I used to get that (I use a unique BAM file of shallow WGS, and the GRCh38 dataset generated by asntech):

library(QDNAseq.hg38)
bins <- getBinAnnotations(binSize=as.numeric(binsize), genome="hg38")
readCounts              <- binReadCounts(bins, bamfiles=bam, chunkSize = 10000000)
readCountsFiltered      <- applyFilters(readCounts, residual=TRUE, blacklist=TRUE)
readCountsFiltered      <- estimateCorrection(readCountsFiltered)
copyNumbers             <- correctBins(readCountsFiltered)
copyNumbersNormalized   <- normalizeBins(copyNumbers)
copyNumbersSmooth       <- smoothOutlierBins(copyNumbersNormalized)
copyNumbersSegmented    <- segmentBins(copyNumbersSmooth, transformFun="sqrt")
copyNumbersSegmented    <- normalizeSegmentedBins(copyNumbersSegmented)
copyNumbersCalled       <- callBins(copyNumbersSegmented)
dir <- paste0(outputdir,"/QDNAseq/")
path <- paste0(dir,sample)
unlink(dir)
dir.create(dir)
setwd(dir)
exportBins(copyNumbersCalled, file=paste0(path,".seg"), format="seg")
exportBins(copyNumbersCalled, file=paste0(path,".vcf"), format="vcf")

My sessionInfo

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

Random number generation:
 RNG:     Mersenne-Twister
 Normal:  Inversion
 Sample:  Rounding

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biobase_2.54.0      BiocGenerics_0.40.0 QDNAseq.hg38_1.0.0
[4] QDNAseq_1.30.0

loaded via a namespace (and not attached):
 [1] parallelly_1.30.0      DNAcopy_1.68.0         XVector_0.34.0
 [4] GenomicRanges_1.46.1   zlibbioc_1.40.0        IRanges_2.28.0
 [7] BiocParallel_1.28.3    impute_1.68.0          GenomeInfoDb_1.30.0
[10] globals_0.14.0         tools_4.1.0            CGHcall_2.56.0
[13] parallel_4.1.0         R.oo_1.24.0            marray_1.72.0
[16] matrixStats_0.61.0     digest_0.6.29          CGHbase_1.54.0
[19] crayon_1.4.2           GenomeInfoDbData_1.2.7 BiocManager_1.30.16
[22] codetools_0.2-18       R.utils_2.11.0         S4Vectors_0.32.3
[25] bitops_1.0-7           RCurl_1.98-1.5         limma_3.50.0
[28] future.apply_1.8.1     compiler_4.1.0         R.methodsS3_1.8.1
[31] Rsamtools_2.10.0       Biostrings_2.62.0      stats4_4.1.0
[34] future_1.23.0          listenv_0.8.0

And I think I found the problem in the code :-)
In exportBins.R, here in exportSEG <- function(obj, fnames=NULL) {...}

        out <- cbind(fnames[i], chr, pos, end, bins, segVal)
        colnames(out) <- c("SAMPLE_NAME", "CHROMOSOME", "START", "STOP", "DATAPOINTS", "LOG2_RATIO_MEAN")

        ## Drop copy-neutral segments
        out <- out[dsel[posI, 4] != 0, ]
        
        write.table(out, file = fnames[i], quote=FALSE, sep="\t", append=FALSE, col.names=TRUE, row.names=FALSE)

The problem is that out <- out[dsel[posI, 4] != 0, ] generates a vector instead of a dataframe if the result has a size of 1 or 0. Then write.table outputs the vector as lines and not as columns.

If I make some check on the size of out, then something like out <- data.frame(matrix(out[dsel[posI, 4] != 0, ], ncol = 6, nrow = 1)), it works.

@HenrikBengtsson
Copy link
Collaborator

Hi. Thanks for reporting. In order to troubleshoot and fix this, could you please do:

saveRDS(copyNumbersCalled, file = "issue_102.rds.zip")

and attach that file in a comment here? We use a fake *.zip extension, because GitHub doesn't support uploading of files with an *.rds extension.

@helaersr
Copy link
Author

Here it is !

issue_102.rds.zip

@HenrikBengtsson
Copy link
Collaborator

Thanks, I can now get the same:

library(QDNAseq)
copyNumbersCalled <- readRDS("issue_102.rds")
file <- exportBins(copyNumbersCalled, file="issue_102.seg", format="seg")
cat(readLines(file), sep = "\n")
x
issue_102.seg
21
14000001
25000000
21
0.17

So far, so good. This means anyone can work on this (I cannot promise anything anytime soon myself.)

Now, I'm very rust on these formats and others might be too. Can you clarify what the problem is with the SEG output and what you'd expect it to be instead? That would help too. Thanks.

@ilarischeinin
Copy link
Member

I’m on phone only at the moment so can’t test it, but I’d imagine this line might just need a “, drop=FALSE”.

out <- out[dsel[posI, 4] != 0, ]

@helaersr
Copy link
Author

helaersr commented Jan 27, 2022

It should be

SAMPLE_NAME CHROMOSOME START STOP DATAPOINTS LOG2_RATIO_MEAN
CLP-1271-3 21 14000001 25000000 21 0.17

The header is missing and the data is in rows while it should be in columns.
Same for VCF, should be

##fileformat=VCFv4.2
##source=QDNAseq-1.30.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CLP-1271-3 21
21 14000001 . <DIP> <DUP> 1000 PASS SVTYPE=DUP;END=25000000;SVLEN=11000000;BINS=21;SCORE=1;LOG2CNT=0.17 GT 0/1

Actually the biggest problem is with the VCF, because it's used downstream with other tools in NGS pipelines

@RodrigoGM
Copy link

Hi @HenrikBengtsson ,

I came across this issue a few years ago, #58. It was closed but never resolved. Not sure what happened with the pull request.

The issue in exportSEG comes from line 325, with

QDNAseq/R/exportBins.R

Lines 323 to 325 in b77cbd7

for (i in 1:ncol(calls)) {
d <- cbind(fd[,1:3], calls[,i], segments[,i])
sel <- !is.na(d[,4])

I suppose it's the same in exportVCF as the code is strikingly similar:

QDNAseq/R/exportBins.R

Lines 233 to 236 in b77cbd7

for (i in 1:ncol(calls)) {
d <- cbind(fd[,1:3], calls[,i], segments[,i])
sel <- !is.na(d[,4])

A solution I had for exportSeg was to have a conditional to include or exclude zeros, for example:
https://github.com/RodrigoGM/QDNAseq/blob/530ee6ff713add0ac4df2363ddbab3ef7937a781/R/exportBins.R#L243

exportSEG <- function(obj, fnames=NULL, includeZero=FALSE, ...) {
...
    for (i in 1:ncol(calls)) {	
	d <- cbind(fd[,1:3],calls[,i], segments[,i])

	if(!includeZero) {
		sel <- d[,4] != 0 & !is.na(d[,4])
	} else {
		sel <- !is.na(d[,4])
	}

instead of just :

exportSEG <- function(obj, fnames=NULL) {

QDNAseq/R/exportBins.R

Lines 323 to 325 in b77cbd7

for (i in 1:ncol(calls)) {
d <- cbind(fd[,1:3], calls[,i], segments[,i])
sel <- !is.na(d[,4])

I have a version with the issue fixed that you can use, but it's an old QDNAseq version from 2019, that' you're welcome to test: https://github.com/RodrigoGM/QDNAseq/.

Hope this helps,

@HenrikBengtsson
Copy link
Collaborator

@ilarischeinin, could you look into fixing this?

ilarischeinin added a commit that referenced this issue Feb 25, 2022
@ilarischeinin ilarischeinin linked a pull request Feb 25, 2022 that will close this issue
@vlshesketh
Copy link

Hi, I have also seen this behaviour, so it's great to see a fix has been issued - will this be released soon?

@HenrikBengtsson
Copy link
Collaborator

We need a unit test for this one, in order to deploy it. Can someone please contribute a test that (currently) fails?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants