unable to 'scatter' plot region of chromosome name with punctuation marks #602

amora197 · 2021-04-26T12:57:23Z

Hello,

I am trying to plot a region of a plant chromosome but to no avail. I first plotted successfully the whole chromosome NC_039902.1 of my sample 1-C7 using the following command:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1 \
-o 1-C7-NC_039902.1.png

When I want to zoom into a specific region of the chromosome, I get the following error:

ValueError: Invalid range spec: NC_039902.1:10000000-15000000 (should be like: chr1:2333000-2444000)

The command I am using for plotting the region wanted is the following:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1:10000000-15000000 \
-o 1-C7-NC_039902.1-zoomed.png

It seems that when I provide a region in -c NC_039902.1:10000000-15000000, the CNVkit raises an error due to the period or point . in the chromosome name NC_039902.1. The error is not raised when I delete the chromosome name suffix .1, but obviously the output figure is empty since the specific chromosome name NC_039902 does not exist in my bed/fasta/bam files.

I am using the original NCBI chromosome names and wish to keep them unchanged. Is there a way around this issue for the cnvkit.py scatter command to tolerate punctuation marks in the chromosome names? Some of the chromosome names I am using are: NC_039898.1 NC_039899.1 NC_039900.1 NC_039901.1 NC_039902.1 etc. And not: chr1 chr2 chr3 etc.

Thanks ahead of time!

-Anibal

The text was updated successfully, but these errors were encountered:

tskir · 2021-04-26T19:05:23Z

Well... let's just say this wasn't a use case that was ever really considered :D

Thank you for reporting this, I've submitted #603 which should fix this problem, hopefully even without breaking anything.

amora197 · 2021-04-27T10:47:39Z

Hi Kirill Tsukanov,

Thank you for the fast response and quick fix!

I went ahead and did the quick change in the skgenome/rangelabel.py for handling NCBI-style chromosome naming as seen in your pull-request. I then re-ran the commands for plotting the whole chromosome and a region of the chromosome. This time the command for zooming into a region ran through and gave me an output, but although the command line can now handle NCBI-style of naming, the output plot is still off. I also noticed that the zoomed-in-region command can output a pdf but not a png. Here I have attached the commands and output pdf's.

No zoom of chromosome NC_039902.1:

Command:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1 \
-o 1-C7-NC_039902.1.no_zoom.pdf

Output:
catuai-vs-1-C7-NC_039902.1.no_zoom.pdf

Zoom into region 10000000-15000000 of chromosome NC_039902.1:

Command:

cnvkit.py scatter 1-C7.cnr \
-s 1-C7.cns \
-c NC_039902.1:10000000-15000000 \
-o 1-C7-NC_039902.1.zoom.pdf

Output:
catuai-vs-1-C7-NC_039902.1.zoom.pdf

Is there an edit that can correctly output the zoomed-in region? We are very close to getting the right output!

Thanks ahead of time,
-Anibal

tskir · 2021-04-27T19:00:17Z

@amora197 Hmm, that's interesting, thanks for letting me know. I didn't have any plant test data at hand, so I just ran my tests using human data with one of the chromosomes renamed to NC_* notation. But apparently there's more to it.

Could you share your .cnr and .cns files please so that I can reproduce this? If it can't be shared publicly please send directly to ktsukanov@ebi.ac.uk

tskir · 2021-04-29T07:18:27Z

Note to self: files received

tskir · 2021-05-06T05:37:00Z

PR is merged, but reopening for additional remaining investigations

tskir · 2021-05-09T18:09:49Z

@amora197 Thank you for waiting; I was now able to take a look into this.

It loooks like the gene column in your CNR file, instead of a gene name, contains an extended set of what looks like GFF3 key-value annotations:

chromosome   start  end   gene                                                                                                                          depth  log2       weight
NC_008535.1  1607   1642  ID=exon-CoarCt002-1;Parent=rna-CoarCt002;Dbxref=GeneID:4421840;gbkey=tRNA;gene=trnK-UUU;locus_tag=CoarCt002;product=tRNA-Lys  7      -0.458055  0.518118

This makes cnvkit.py segment interpret each line as a separate gene with a really long name. Assuming that the gene name in the annotation above is trnK-UUU, the CNR file should instead look like this:

chromosome   start  end   gene      depth  log2       weight
NC_008535.1  1607   1642  trnK-UUU  7      -0.458055  0.518118

In turn, the gene names in your CNR file look like this almost certainly because something went wrong during the target construction (cnvkit.py target step). Possibility 1 is that the original BED file you used had gene names like this; possibility 2 is that the gene names ingested from the --annotate flag were in this format and went on to supersede the ones in the BED file.

So, in light of this, could you please elaborate how your target files were constructed? What were the commands/flags used and how does the original (unprocessed) target BED look like?

amora197 · 2021-05-27T12:41:54Z

@tskir Thank you for your reply and for the wait. We took into account your suggestions and tried some troubleshooting.

We have indeed used a GFF3 file to build our bed file (general case for us). The GFF3 file we used was downloaded using the following NCBI websites:

We had previously refrained from editing the gene column. We want to keep as much information as possible for downstream analysis and, more importantly, no individual tag in the column is unique. After your suggestions, we decided to reduce to the ID= field in the gene column since it was at least present in each row. The resulting plot was the following:

We realized that there was non-uniqueness in the ID= field under the gene column, which we figured might cause the glitch. We hence created a unique entry gene name for each row by simply assigning a number. The resulting plot was the following:

We realized that the gene name vertical buildup might be due to overlapping regions, so we merged overlaps with bedtools merge on our bed file. Running bedtools merge discards the gene column, so we once again gave each row a unique gene numerical ID. The resulting plot was the following:

In summary, we were able to zoom into a region of interest after preparing our bed file like so:

sort bed file
merge overlapping regions of bed file
assign unique ID under gene column per row entry

More general, though: Can you please detail how the ideal bed file for running CNVkit needs to look like? More specifically, could we please get advised on the following:

Do gene name column entries need to be unique? What is the max allowed length of the string?
Are overlapping regions allowed? (Imagine overlapping genes, on opposite strands or splice variants)
Is there a minimum region size? and on a similar note: Is CNVkit splitting large regions into smaller windows?

Finally we have a feature request:
Our organisms are triploid. Do you see a scope for an optional ploidy parameter in CNVkit, so that we detect 2 vs 3 (diploid vs triploid) rather than your default 1 vs 2 (diploid vs haploid), a respective command-line flag would be nice.

Thank you again for your assistance. We look forward to your reply.

etal · 2021-05-28T05:10:00Z

Thanks for the details, @amora197 .

CNVkit can read GFF directly; you can give a GFF3/GTF/GFF2 file as input in most places where BED works. You can use the bundled script skg_convert.py to just convert tabular files from one supported format to another:

https://cnvkit.readthedocs.io/en/stable/scripts.html
All input formats: https://github.com/etal/cnvkit/blob/master/skgenome/tabio/__init__.py#L113-L134
GFF implementation: https://github.com/etal/cnvkit/blob/master/skgenome/tabio/gff.py

The GFF reader checks for these tags in order: Name, gene_id, gene_name, gene. It will take the value of the first match and use it as the "gene" for the rest of CNVkit's purposes.

Gene names do not need to be unique. Apparently the gene name is allowed to be pretty long, as you've seen; CNVkit does not enforce any limit, but the plots start to look weird when the strings are very long.

Overlapping regions are allowed. There is no minimum region size; 0 or smaller will tend to get dropped automatically within the CNVkit pipeline.

When you give a target BED as input to the batch or target command, long regions will automatically be split into smaller windows. It's possible to skip this with different command line options. Details here:
https://cnvkit.readthedocs.io/en/stable/pipeline.html#target

CNVkit will work for non-diploid genomes already. The call command has an option --ploidy which defaults to 2, but you can use 3 or another integer here instead.

tskir added bug plots labels Apr 26, 2021

tskir mentioned this issue Apr 26, 2021

Update the region regex to allow NCBI-style chromosome names #603

Merged

tskir added the skgenome label Apr 26, 2021

etal closed this as completed in #603 May 5, 2021

tskir reopened this May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to 'scatter' plot region of chromosome name with punctuation marks #602

unable to 'scatter' plot region of chromosome name with punctuation marks #602

amora197 commented Apr 26, 2021

tskir commented Apr 26, 2021

amora197 commented Apr 27, 2021

tskir commented Apr 27, 2021 •

edited

Loading

tskir commented Apr 29, 2021

tskir commented May 6, 2021

tskir commented May 9, 2021

amora197 commented May 27, 2021 •

edited

Loading

etal commented May 28, 2021

unable to 'scatter' plot region of chromosome name with punctuation marks #602

unable to 'scatter' plot region of chromosome name with punctuation marks #602

Comments

amora197 commented Apr 26, 2021

tskir commented Apr 26, 2021

amora197 commented Apr 27, 2021

No zoom of chromosome NC_039902.1:

Zoom into region 10000000-15000000 of chromosome NC_039902.1:

tskir commented Apr 27, 2021 • edited Loading

tskir commented Apr 29, 2021

tskir commented May 6, 2021

tskir commented May 9, 2021

amora197 commented May 27, 2021 • edited Loading

etal commented May 28, 2021

tskir commented Apr 27, 2021 •

edited

Loading

amora197 commented May 27, 2021 •

edited

Loading