exportBins to seg/vcf when one or zero CNV are called #102

helaersr · 2022-01-27T14:07:51Z

Hello,

There is a problem when outputing seg and vcf file and the total CNV called is zero or one.

There is an example of seg file with one CNV:

rhelaers@ddus:~/qdnaseq$ cat CLP-1271-3.seg
x
/home/users/rhelaers/qdnaseq/CLP-1271-3.seg
21
14000001
25000000
21
0.17

And the vcf:

rhelaers@ddus:~/qdnaseq$ cat CLP-1271-3.vcf
##fileformat=VCFv4.2
##source=QDNAseq-1.30.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
x
21
14000001
.
<DIP>
<DUP>
1000
PASS
SVTYPE=DUP;END=25000000;SVLEN=11000000;BINS=21;SCORE=1;LOG2CNT=0.17
GT
0/1

The code I used to get that (I use a unique BAM file of shallow WGS, and the GRCh38 dataset generated by asntech):

library(QDNAseq.hg38)
bins <- getBinAnnotations(binSize=as.numeric(binsize), genome="hg38")
readCounts              <- binReadCounts(bins, bamfiles=bam, chunkSize = 10000000)
readCountsFiltered      <- applyFilters(readCounts, residual=TRUE, blacklist=TRUE)
readCountsFiltered      <- estimateCorrection(readCountsFiltered)
copyNumbers             <- correctBins(readCountsFiltered)
copyNumbersNormalized   <- normalizeBins(copyNumbers)
copyNumbersSmooth       <- smoothOutlierBins(copyNumbersNormalized)
copyNumbersSegmented    <- segmentBins(copyNumbersSmooth, transformFun="sqrt")
copyNumbersSegmented    <- normalizeSegmentedBins(copyNumbersSegmented)
copyNumbersCalled       <- callBins(copyNumbersSegmented)
dir <- paste0(outputdir,"/QDNAseq/")
path <- paste0(dir,sample)
unlink(dir)
dir.create(dir)
setwd(dir)
exportBins(copyNumbersCalled, file=paste0(path,".seg"), format="seg")
exportBins(copyNumbersCalled, file=paste0(path,".vcf"), format="vcf")

My sessionInfo

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

Random number generation:
 RNG:     Mersenne-Twister
 Normal:  Inversion
 Sample:  Rounding

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Biobase_2.54.0      BiocGenerics_0.40.0 QDNAseq.hg38_1.0.0
[4] QDNAseq_1.30.0

loaded via a namespace (and not attached):
 [1] parallelly_1.30.0      DNAcopy_1.68.0         XVector_0.34.0
 [4] GenomicRanges_1.46.1   zlibbioc_1.40.0        IRanges_2.28.0
 [7] BiocParallel_1.28.3    impute_1.68.0          GenomeInfoDb_1.30.0
[10] globals_0.14.0         tools_4.1.0            CGHcall_2.56.0
[13] parallel_4.1.0         R.oo_1.24.0            marray_1.72.0
[16] matrixStats_0.61.0     digest_0.6.29          CGHbase_1.54.0
[19] crayon_1.4.2           GenomeInfoDbData_1.2.7 BiocManager_1.30.16
[22] codetools_0.2-18       R.utils_2.11.0         S4Vectors_0.32.3
[25] bitops_1.0-7           RCurl_1.98-1.5         limma_3.50.0
[28] future.apply_1.8.1     compiler_4.1.0         R.methodsS3_1.8.1
[31] Rsamtools_2.10.0       Biostrings_2.62.0      stats4_4.1.0
[34] future_1.23.0          listenv_0.8.0

And I think I found the problem in the code :-)
In exportBins.R, here in exportSEG <- function(obj, fnames=NULL) {...}

        out <- cbind(fnames[i], chr, pos, end, bins, segVal)
        colnames(out) <- c("SAMPLE_NAME", "CHROMOSOME", "START", "STOP", "DATAPOINTS", "LOG2_RATIO_MEAN")

        ## Drop copy-neutral segments
        out <- out[dsel[posI, 4] != 0, ]
        
        write.table(out, file = fnames[i], quote=FALSE, sep="\t", append=FALSE, col.names=TRUE, row.names=FALSE)

The problem is that out <- out[dsel[posI, 4] != 0, ] generates a vector instead of a dataframe if the result has a size of 1 or 0. Then write.table outputs the vector as lines and not as columns.

If I make some check on the size of out, then something like out <- data.frame(matrix(out[dsel[posI, 4] != 0, ], ncol = 6, nrow = 1)), it works.

The text was updated successfully, but these errors were encountered:

HenrikBengtsson · 2022-01-27T18:14:55Z

Hi. Thanks for reporting. In order to troubleshoot and fix this, could you please do:

saveRDS(copyNumbersCalled, file = "issue_102.rds.zip")

and attach that file in a comment here? We use a fake *.zip extension, because GitHub doesn't support uploading of files with an *.rds extension.

helaersr · 2022-01-27T19:02:49Z

Here it is !

issue_102.rds.zip

HenrikBengtsson · 2022-01-27T19:15:15Z

Thanks, I can now get the same:

library(QDNAseq)
copyNumbersCalled <- readRDS("issue_102.rds")
file <- exportBins(copyNumbersCalled, file="issue_102.seg", format="seg")
cat(readLines(file), sep = "\n")
x
issue_102.seg
21
14000001
25000000
21
0.17

So far, so good. This means anyone can work on this (I cannot promise anything anytime soon myself.)

Now, I'm very rust on these formats and others might be too. Can you clarify what the problem is with the SEG output and what you'd expect it to be instead? That would help too. Thanks.

ilarischeinin · 2022-01-27T19:32:35Z

I’m on phone only at the moment so can’t test it, but I’d imagine this line might just need a “, drop=FALSE”.

QDNAseq/R/exportBins.R

Line 355 in b77cbd7

out <- out[dsel[posI, 4] != 0, ]

helaersr · 2022-01-27T19:33:17Z

It should be

SAMPLE_NAME CHROMOSOME START STOP DATAPOINTS LOG2_RATIO_MEAN
CLP-1271-3 21 14000001 25000000 21 0.17

The header is missing and the data is in rows while it should be in columns.
Same for VCF, should be

##fileformat=VCFv4.2
##source=QDNAseq-1.30.0
##REF=<ID=DIP,Description="CNV call">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##FILTER=<ID=LOWQ,Description="Filtered due to call in low quality region">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of variant: DEL,DUP,INS">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of variant">
##INFO=<ID=BINS,Number=1,Type=Integer,Description="Number of bins in call">
##INFO=<ID=SCORE,Number=1,Type=Integer,Description="Score of calling algorithm">
##INFO=<ID=LOG2CNT,Number=1,Type=Float,Description="Log 2 count">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CLP-1271-3 21
21 14000001 . <DIP> <DUP> 1000 PASS SVTYPE=DUP;END=25000000;SVLEN=11000000;BINS=21;SCORE=1;LOG2CNT=0.17 GT 0/1

Actually the biggest problem is with the VCF, because it's used downstream with other tools in NGS pipelines

RodrigoGM · 2022-02-08T19:29:21Z

Hi @HenrikBengtsson ,

I came across this issue a few years ago, #58. It was closed but never resolved. Not sure what happened with the pull request.

The issue in exportSEG comes from line 325, with

QDNAseq/R/exportBins.R

Lines 323 to 325 in b77cbd7

    
           for (i in 1:ncol(calls)) {         
        
               d <- cbind(fd[,1:3], calls[,i], segments[,i]) 
        
               sel <- !is.na(d[,4])

I suppose it's the same in exportVCF as the code is strikingly similar:

QDNAseq/R/exportBins.R

Lines 233 to 236 in b77cbd7

    
           for (i in 1:ncol(calls)) {         
        
               d <- cbind(fd[,1:3], calls[,i], segments[,i]) 
        
               sel <- !is.na(d[,4])

A solution I had for exportSeg was to have a conditional to include or exclude zeros, for example:
https://github.com/RodrigoGM/QDNAseq/blob/530ee6ff713add0ac4df2363ddbab3ef7937a781/R/exportBins.R#L243

exportSEG <- function(obj, fnames=NULL, includeZero=FALSE, ...) {
...
    for (i in 1:ncol(calls)) {	
	d <- cbind(fd[,1:3],calls[,i], segments[,i])

	if(!includeZero) {
		sel <- d[,4] != 0 & !is.na(d[,4])
	} else {
		sel <- !is.na(d[,4])
	}

instead of just :

QDNAseq/R/exportBins.R

Line 300 in b77cbd7

exportSEG <- function(obj, fnames=NULL) {

QDNAseq/R/exportBins.R

Lines 323 to 325 in b77cbd7

    
           for (i in 1:ncol(calls)) {         
        
               d <- cbind(fd[,1:3], calls[,i], segments[,i]) 
        
               sel <- !is.na(d[,4])

I have a version with the issue fixed that you can use, but it's an old QDNAseq version from 2019, that' you're welcome to test: https://github.com/RodrigoGM/QDNAseq/.

Hope this helps,

HenrikBengtsson · 2022-02-24T01:34:22Z

@ilarischeinin, could you look into fixing this?

Hopefully fixes #102 (untested)

vlshesketh · 2023-02-03T09:59:54Z

Hi, I have also seen this behaviour, so it's great to see a fix has been issued - will this be released soon?

HenrikBengtsson · 2024-01-08T21:31:45Z

We need a unit test for this one, in order to deploy it. Can someone please contribute a test that (currently) fails?

HenrikBengtsson added the bug label Feb 24, 2022

ilarischeinin added a commit that referenced this issue Feb 25, 2022

Prevent data.frames from turning into vectors

811c264

Hopefully fixes #102 (untested)

ilarischeinin linked a pull request Feb 25, 2022 that will close this issue

Prevent data.frames from turning into vectors #106

Open

HenrikBengtsson added the help wanted label Jan 8, 2024

HenrikBengtsson added this to the Next release milestone Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exportBins to seg/vcf when one or zero CNV are called #102

exportBins to seg/vcf when one or zero CNV are called #102

helaersr commented Jan 27, 2022

HenrikBengtsson commented Jan 27, 2022

helaersr commented Jan 27, 2022

HenrikBengtsson commented Jan 27, 2022

ilarischeinin commented Jan 27, 2022

helaersr commented Jan 27, 2022 •

edited

Loading

RodrigoGM commented Feb 8, 2022

HenrikBengtsson commented Feb 24, 2022

vlshesketh commented Feb 3, 2023

HenrikBengtsson commented Jan 8, 2024

exportBins to seg/vcf when one or zero CNV are called #102

exportBins to seg/vcf when one or zero CNV are called #102

Comments

helaersr commented Jan 27, 2022

HenrikBengtsson commented Jan 27, 2022

helaersr commented Jan 27, 2022

HenrikBengtsson commented Jan 27, 2022

ilarischeinin commented Jan 27, 2022

helaersr commented Jan 27, 2022 • edited Loading

RodrigoGM commented Feb 8, 2022

HenrikBengtsson commented Feb 24, 2022

vlshesketh commented Feb 3, 2023

HenrikBengtsson commented Jan 8, 2024

helaersr commented Jan 27, 2022 •

edited

Loading