Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnvkit export vcf does not work when nbins is used instead of probes in cns header #585

Closed
eriktoo opened this issue Mar 23, 2021 · 4 comments · Fixed by #586
Closed

cnvkit export vcf does not work when nbins is used instead of probes in cns header #585

eriktoo opened this issue Mar 23, 2021 · 4 comments · Fixed by #586

Comments

@eriktoo
Copy link

eriktoo commented Mar 23, 2021

cnvkit version: 0.9.8

In my pipeline, after generating .cnr files (diploid X) they are processed by genemetrics without segments, then segmetrics (with --gender female to avoid shifting X log2s) to get statistics, and finally with call. Exporting the final call .cns to bed format works fine, but when exporting to vcf the files contain only the header and no cnvs. When running a pipeline that does normal segmentation instead of genemetrics export to vcf works. The only difference in the header is that the genemetrics case has a column 'n_bins' and the regular segmentation a column 'probes'.

The responsible line in export.py is 266, which does not check for the presence of n_bins or probes and assumes the latter:

out_dframe = segments.data.reindex(columns=["chromosome", "end", "log2", "probes"])

A quick fix is to run the files with n_bins through sed to convert to probes:

sed s/n_bins/probes/ $file

@tskir
Copy link
Collaborator

tskir commented Mar 24, 2021

We might need to wait for @etal's opinion here, but I don't think the export command was intended to support this case. One of the fields in the output VCF is:

##INFO=<ID=PROBES,Number=1,Type=Integer,Description="Number of probes in CNV">

Since the CNS from genemetrics does not have the probes column, and because the number of probes is not generally the same as the number of bins (as each bin can contain more than one probe), there just isn't enough information to generate the output VCF.

@tskir
Copy link
Collaborator

tskir commented Mar 24, 2021

On the other hand, what we could do is just retain the probes column in the genemetrics output. I don't see why not, since this can be a useful metric.

@tskir
Copy link
Collaborator

tskir commented Mar 24, 2021

No, wait, I'm wrong. CNN/CNR already contain bins, not the original probes. So the probes in regular CNS and n_bins in genemetrics CNS are exactly the same thing, just differently named. I suppose this can be fixed to make things consistent & allow calling from the genemetrics CNS as well.

tskir added a commit to tskir/cnvkit that referenced this issue Mar 24, 2021
The column `n_bins` in the genemetrics output means exactly the same
thing as `probes` in regular CNS files. For compatibility, it is better
if they are named the same. This would also allow feeding genemetrics
CNS files to the call and export commands (see etal#585).
tskir added a commit to tskir/cnvkit that referenced this issue Mar 24, 2021
The column `n_bins` in the genemetrics output means exactly the same
thing as `probes` in regular CNS files. For compatibility, it is better
if they are named the same. This would also allow feeding genemetrics
CNS files to the call and export commands (see etal#585).
@etal
Copy link
Owner

etal commented Mar 25, 2021

It did not occur to me to try running the genemetrics output table through export vcf, and I'm impressed you got this far.

I agree with @tskir 's solution to just change the header of the genemetrics output table to use probes instead of n_bins. (I wrote the genemetrics code long after the segmentation code, and I like the semantics of "n_bins" better than "probes", but it would be too disruptive to change the .cnr and .cns headers now.)

@etal etal closed this as completed in #586 Mar 25, 2021
etal pushed a commit that referenced this issue Mar 25, 2021
The column `n_bins` in the genemetrics output means exactly the same
thing as `probes` in regular CNS files. For compatibility, it is better
if they are named the same. This would also allow feeding genemetrics
CNS files to the call and export commands (see #585).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants