-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
genemetrics completing without error part way through chromosome one #576
Comments
I can confirm this behavior. When I filter my .cnr files to remove chr1 genemetrics outputs only chr2. Have tested with -t down to 0. |
I have tracked this issue as far as reports.py so far. In 'group_by_genes': 'not rows' in
evaluates True for all genes after the first Antitarget entry in the cnarr. If I print the rows which do exist, each Antitarget seems to be associated with the row of the gene immediately following it in the cnarr. To illustrate I added the following print statements to the beginning of the for loop in 'group_by_genes()':
Below is a selection of the output. The first two genes in the ouput are handled correctly, but Antitarget contains the row for the first gene in chr2. The genes in chr2 have null rows, and the next Antitarget has the row for the first gene in chr3. This pattern continues.
|
I believe I have located the problem in cnary.py. Calling the CopyNumArray.by_gene() method of an instance of CopyNumArray yields:
When genemetrics runs without segments the CopyNumArray which calls by_gene() contains all bins for all chromosomes and by_genes() iterates over this by chromosome. When genemetrics runs with segments by_genes() is called individually by CopyNumArrays representing only that segment. This also means that each of these arrays contains only bins from one chromosome and thus by_genes() only runs for one iteration each time. In both cases each bins in the CopyNumArray is assigned a name in the form of an integer from 0 to n-1 where n is the total number of bins across all chromosomes. In the case of a CopyNumArray for a segment the bins might have a names ranging from [a:b] such that 0 <= a < b <= n-1. All bins are thus numbered absolutely with reference to the total number of bins, even if a subset of bins are extracted from the total set. This occurs when by_gene() iterate over each chromosome/segment: a sub-array is created containing only the bins for that chromosome/segment. by_gene() then extracts from this subarray the name of each gene and a list containing the names of the bins associated with that gene. It uses these to yield only the bins associated with that gene. The problem that arises is that CopyNumArrays are accessed like typical arrays, with indices from [0:n-1]. When the whole CopyNumArray is considered the bin names coincide with their indices, but when a subarray is extracted its indices start at 0 and the gene bin names are out of sync (except in the case of the first chromosome or segment). Because the bin names are continually increasing but each subarray starts at 0 this results in trying to access subarrays with indices that exceed their length. The result is that the objects returned by by_gene() evaluate to None. What is required is to synchronize each gene's bin names with the indices of the chromosomal subarrays. I have done this by adding two lines to the by_gene() method (lines 4 and 7 below indicated by comments). The idea simply to get the name of the first bin in each chromosomal CopyNumArray, then shift each gene index array to the left by subtracting that value. ignore += params.ANTITARGET_ALIASES
start_idx = end_idx = None
for _chrom, subgary in self.by_chromosome():
subgary_start_idx = subgary.data.iloc[0].name # get the name of the first bin in this chromosome subarray
prev_idx = 0
for gene, gene_idx in subgary._get_gene_map().items():
gene_idx = [idx - subgary_start_idx for idx in gene_idx] # shift gene_idx left by subgary_start_idx I've tested this with and without segments on some samples from my dataset. They were sequenced with a small (45 gene) amplicon based targeted sequencing panel, and using -t 0 -m 0 all genes are output. I cannot speak to datasets with antitargets, as mine does not have these. Hopefully the developers can incorporate a fix in the next release. |
Jumping in to confirm this bug in version 0.9.9.dev0. I was wondering why I only saw events on chromosome 1 when running |
Glad it helped! One thing I am noticing today is that I am not getting any statistics in my output when calling genemetrics with --ci, --sem, --stdev, etc. Would you mind testing that on your end? |
Funny you should ask! Now that I can see calls on all chromosomes, I have a ton of calls and wondering where the stats are to help find outliers. I think something's not quite right with that component either somehow. The other thing that seems odd but may be right is that there is an option for filtering by number of probes ( By the way, I forgot to mention that I've been running this on WES data. So it seems that your patch may work on both targeted and WES data so far. |
From the doc for genemetrics I think bins and probes are the same thing:
I know that in my panel if I run with default settings I only get 43 of 45 genes because 2 genes only have 2 bins each. I'm in the same boat - looking for metrics to refine the set of calls. |
Ahhh! There's a critical detail in that snippet from the docs that I didn't catch before. If you input a segment file, then the option works on the Still can't get the extra stats in the output, though. edit: Had a look through the code, and it looks like those stats have not been implemented yet. I see a TODO item to add those in the main calling script, and then no place for those args to be input in the |
Yikes - I found the same thing but didn't notice the TODO! I just opened an issue but I'll close it. I might take a look at adapting the logic from segmetrics, but that could be a rabbit hole. |
@drmrgd I have what seems to be a viable workaround for the absence of stats in genemetrics. I am running genemetrics output into segmetrics, and from there into call. This enables running call with --ci or --sem filters, as well as the possibility of including tumor purity and BAF info if applicable, as it is in my case. A word of caution on using both genemetrics and call, though, as both attempt to determine gender if not explicitly given to those commands which can result in transformations of the data. In my case I have a diploidX reference. If a sample is detected as male its X chromosome log2 ratios are increased by 1. The problem is that when genemetrics does this for males, call sees the transformed value and detects those samples as female. Consequently the wrong copy number is reported for X genes (e.g. 2 copies). I counter this by running genemetrics with '--gender f' to avoid X shifting at that stage, but let call detect gender per usual. I've tested it on a small subset of my sample set and the genders are correctly handled. |
@eriktoo That looks like a nice workaround. Thank you! I started trying code out my own QC metrics and stats to take the place of those missing, but this would be a much better solution. I'll give this a shot when I can break free. Really appreciate all of the help! |
I am using a conda installed cnvkit 0.9.8 environment (and this has also been tested on a conda installed cnvkit 0.9.7 environment, with the same result).
cnvkit.pl batch runs successfully, producing a genome-wide .cnr and .cns per sample.
cnvkit.pl genemetrics ran with e.g. ..
cnvkit.py genemetrics $cnr -s $cns -t 0.1 -m 3 --output $output_fn
.. also claims to run successfully, printing the output:
However the 108 lines in $output_fn only cover up to chr1 16592382.
E.g. tail $output_fn
The same thing occurs for all samples (up to varying chr1 positions).
Can you shed any light on the problem here? The results are the same regardless of the number of cores provided to the task.
The text was updated successfully, but these errors were encountered: