Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genemetrics completing without error part way through chromosome one #576

Closed
hrkemp opened this issue Mar 2, 2021 · 13 comments · Fixed by #580
Closed

genemetrics completing without error part way through chromosome one #576

hrkemp opened this issue Mar 2, 2021 · 13 comments · Fixed by #580

Comments

@hrkemp
Copy link

hrkemp commented Mar 2, 2021

I am using a conda installed cnvkit 0.9.8 environment (and this has also been tested on a conda installed cnvkit 0.9.7 environment, with the same result).

cnvkit.pl batch runs successfully, producing a genome-wide .cnr and .cns per sample.

cnvkit.pl genemetrics ran with e.g. ..

cnvkit.py genemetrics $cnr -s $cns -t 0.1 -m 3 --output $output_fn

.. also claims to run successfully, printing the output:

Treating sample [ $ID ] as female
Found 108 gene-level gains and losses
Wrote [ $output_fn ]

However the 108 lines in $output_fn only cover up to chr1 16592382.

E.g. tail $output_fn

SRM	chr1	15944260	15947322	-0.254386	349.365	5.9755	-0.250492	-0.259758	7	1660.24	2111
EXOSC10	chr1	15948134	16029878	-0.254386	365.13	20.4798	-0.250492	-0.259758	27	1660.24	2111
MTOR	chr1	16029943	16217204	-0.254386	391.329	55.5135	-0.250492	-0.259758	73	1660.24	2111
UBIAD1	chr1	16229477	16250183	-0.254386	339.729	5.65058	-0.250492	-0.259758	7	1660.24	2111
DISP3	chr1	16250969	16404047	-0.254386	313.915	23.4543	-0.250492	-0.259758	29	1660.24	2111
FBXO2	chr1	16408865	16410975	-0.254386	268.668	2.77106	-0.250492	-0.259758	5	1660.24	2111
FBXO44	chr1	16411475	16428079	-0.254386	418.085	4.39503	-0.250492	-0.259758	6	1660.24	2111
FBXO6	chr1	16431081	16450042	-0.254386	565.949	5.6832	-0.250492	-0.259758	7	1660.24	2111
MAD2L2	chr1	16450047	16535372	-0.254386	255.099	6.02192	-0.250492	-0.259758	8	1660.24	2111
DRAXIN	chr1	16536708	16592382	-0.254386	506.403	7.49921	-0.250492	-0.259758	12	1660.24	2111

The same thing occurs for all samples (up to varying chr1 positions).

Can you shed any light on the problem here? The results are the same regardless of the number of cores provided to the task.

@eriktoo
Copy link

eriktoo commented Mar 11, 2021

I can confirm this behavior. When I filter my .cnr files to remove chr1 genemetrics outputs only chr2. Have tested with -t down to 0.

@eriktoo
Copy link

eriktoo commented Mar 12, 2021

I have tracked this issue as far as reports.py so far.

In 'group_by_genes': 'not rows' in

if not rows or gene in ignore: continue

evaluates True for all genes after the first Antitarget entry in the cnarr.

If I print the rows which do exist, each Antitarget seems to be associated with the row of the gene immediately following it in the cnarr.

To illustrate I added the following print statements to the beginning of the for loop in 'group_by_genes()':

for gene, rows in cnarr.by_gene(): print(gene, rows, "Gene in ignore: " + str(gene in ignore), "Not rows: " + (str(not rows))) if rows: print(rows[0])

Below is a selection of the output. The first two genes in the ouput are handled correctly, but Antitarget contains the row for the first gene in chr2. The genes in chr2 have null rows, and the next Antitarget has the row for the first gene in chr3. This pattern continues.

MDM4 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc96b80> Gene in ignore: False Not rows: False
chromosome chr1
start 204494631
end 204494781
gene MDM4
log2 -1.21488
depth 714.92
weight 0.944219
baf NaN
Name: 114, dtype: object
Antitarget <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c3a0> Gene in ignore: True Not rows: False
chromosome chr2
start 25457135
end 25457260
gene DNMT3A
log2 -0.084103
depth 1292.74
weight 0.932481
baf NaN
Name: 126, dtype: object
DNMT3A <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c4c0> Gene in ignore: False Not rows: True
STAT1 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c3a0> Gene in ignore: False Not rows: True
CTLA4 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c4c0> Gene in ignore: False Not rows: True
ACADL <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c3a0> Gene in ignore: False Not rows: True
GIGYF2 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c4c0> Gene in ignore: False Not rows: True
Antitarget <cnvlib.cnary.CopyNumArray object at 0x7fd16fc87280> Gene in ignore: True Not rows: False
chromosome chr3
start 47058484
end 47058776
gene SETD2
log2 -0.130924
depth 2624.51
weight 0.958737
baf NaN
Name: 207, dtype: object
SETD2 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc0a070> Gene in ignore: False Not rows: True
PIK3CA <cnvlib.cnary.CopyNumArray object at 0x7fd16fc87280> Gene in ignore: False Not rows: True
TP63 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc0a070> Gene in ignore: False Not rows: True
Antitarget <cnvlib.cnary.CopyNumArray object at 0x7fd16fc0a760> Gene in ignore: True Not rows: False
chromosome chr4
start 55124909
end 55125032
gene PDGFRA
log2 -1.25353
depth 1220.48
weight 0.960569
baf NaN
Name: 280, dtype: object
PDGFRA <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c4c0> Gene in ignore: False Not rows: True
KDR <cnvlib.cnary.CopyNumArray object at 0x7fd16fc0a760> Gene in ignore: False Not rows: True
SMAD1 <cnvlib.cnary.CopyNumArray object at 0x7fd16fc9c4c0> Gene in ignore: False Not rows: True
Antitarget <cnvlib.cnary.CopyNumArray object at 0x7fd16fb72e20> Gene in ignore: True Not rows: False
chromosome chr5
start 36953781
end 36953895
gene NIPBL
log2 -1.08932
depth 951.912
weight 0.950597
baf NaN

@eriktoo
Copy link

eriktoo commented Mar 15, 2021

I believe I have located the problem in cnary.py.

Calling the CopyNumArray.by_gene() method of an instance of CopyNumArray yields:

Pairs of: (gene name, CNA of rows with same name)

When genemetrics runs without segments the CopyNumArray which calls by_gene() contains all bins for all chromosomes and by_genes() iterates over this by chromosome. When genemetrics runs with segments by_genes() is called individually by CopyNumArrays representing only that segment. This also means that each of these arrays contains only bins from one chromosome and thus by_genes() only runs for one iteration each time.

In both cases each bins in the CopyNumArray is assigned a name in the form of an integer from 0 to n-1 where n is the total number of bins across all chromosomes. In the case of a CopyNumArray for a segment the bins might have a names ranging from [a:b] such that 0 <= a < b <= n-1.

All bins are thus numbered absolutely with reference to the total number of bins, even if a subset of bins are extracted from the total set. This occurs when by_gene() iterate over each chromosome/segment: a sub-array is created containing only the bins for that chromosome/segment. by_gene() then extracts from this subarray the name of each gene and a list containing the names of the bins associated with that gene. It uses these to yield only the bins associated with that gene.

The problem that arises is that CopyNumArrays are accessed like typical arrays, with indices from [0:n-1]. When the whole CopyNumArray is considered the bin names coincide with their indices, but when a subarray is extracted its indices start at 0 and the gene bin names are out of sync (except in the case of the first chromosome or segment). Because the bin names are continually increasing but each subarray starts at 0 this results in trying to access subarrays with indices that exceed their length. The result is that the objects returned by by_gene() evaluate to None.

What is required is to synchronize each gene's bin names with the indices of the chromosomal subarrays. I have done this by adding two lines to the by_gene() method (lines 4 and 7 below indicated by comments). The idea simply to get the name of the first bin in each chromosomal CopyNumArray, then shift each gene index array to the left by subtracting that value.

        ignore += params.ANTITARGET_ALIASES
        start_idx = end_idx = None
        for _chrom, subgary in self.by_chromosome():
            subgary_start_idx = subgary.data.iloc[0].name # get the name of the first bin in this chromosome subarray
            prev_idx = 0
            for gene, gene_idx in subgary._get_gene_map().items():
                gene_idx = [idx - subgary_start_idx for idx in gene_idx] # shift gene_idx left by subgary_start_idx

I've tested this with and without segments on some samples from my dataset. They were sequenced with a small (45 gene) amplicon based targeted sequencing panel, and using -t 0 -m 0 all genes are output. I cannot speak to datasets with antitargets, as mine does not have these.

Hopefully the developers can incorporate a fix in the next release.

@drmrgd
Copy link

drmrgd commented Mar 16, 2021

Jumping in to confirm this bug in version 0.9.9.dev0. I was wondering why I only saw events on chromosome 1 when running genemetrics. I applied the suggested fix and am now seeing calls on all chromosomes. Thanks for posting this solution!

@eriktoo
Copy link

eriktoo commented Mar 16, 2021

Jumping in to confirm this bug in version 0.9.9.dev0. I was wondering why I only saw events on chromosome 1 when running genemetrics. I applied the suggested fix and am now seeing calls on all chromosomes. Thanks for posting this solution!

Glad it helped!

One thing I am noticing today is that I am not getting any statistics in my output when calling genemetrics with --ci, --sem, --stdev, etc. Would you mind testing that on your end?

@drmrgd
Copy link

drmrgd commented Mar 16, 2021

Funny you should ask! Now that I can see calls on all chromosomes, I have a ton of calls and wondering where the stats are to help find outliers. I think something's not quite right with that component either somehow.

The other thing that seems odd but may be right is that there is an option for filtering by number of probes (-m), though that doesn't seem to be in the output. There is a number of bins column, though, which is not affected by the --min-probes option. That may be expected, though?

By the way, I forgot to mention that I've been running this on WES data. So it seems that your patch may work on both targeted and WES data so far.

@eriktoo
Copy link

eriktoo commented Mar 16, 2021

From the doc for genemetrics I think bins and probes are the same thing:

Some likely false positives can be eliminated by dropping CNVs that cover a small number of bins, at the risk of missing some true positives. With -m 3, the default, genes where only 1 or 2 bins show copy number change will not be reported. This applies to the segment’s bin count (seg_probes) if segments are provided with -s, otherwise it’s the gene’s bin count (nbins).

I know that in my panel if I run with default settings I only get 43 of 45 genes because 2 genes only have 2 bins each.

I'm in the same boat - looking for metrics to refine the set of calls.

@drmrgd
Copy link

drmrgd commented Mar 16, 2021

Ahhh! There's a critical detail in that snippet from the docs that I didn't catch before. If you input a segment file, then the option works on the seg_probes column, else it works on n_bins. I've always been running with the segmentation file, which is probably why I'm still seeing n_bins sizes less than that threshold being reported. Thanks for pointing that out!

Still can't get the extra stats in the output, though.

edit: Had a look through the code, and it looks like those stats have not been implemented yet. I see a TODO item to add those in the main calling script, and then no place for those args to be input in the do_genemetrics method of reports.py where the data are generated. So, I think that may be why they don't show up in the output.

@eriktoo
Copy link

eriktoo commented Mar 16, 2021

Yikes - I found the same thing but didn't notice the TODO! I just opened an issue but I'll close it. I might take a look at adapting the logic from segmetrics, but that could be a rabbit hole.

@eriktoo
Copy link

eriktoo commented Mar 18, 2021

@drmrgd I have what seems to be a viable workaround for the absence of stats in genemetrics. I am running genemetrics output into segmetrics, and from there into call. This enables running call with --ci or --sem filters, as well as the possibility of including tumor purity and BAF info if applicable, as it is in my case.

A word of caution on using both genemetrics and call, though, as both attempt to determine gender if not explicitly given to those commands which can result in transformations of the data. In my case I have a diploidX reference. If a sample is detected as male its X chromosome log2 ratios are increased by 1. The problem is that when genemetrics does this for males, call sees the transformed value and detects those samples as female. Consequently the wrong copy number is reported for X genes (e.g. 2 copies).

I counter this by running genemetrics with '--gender f' to avoid X shifting at that stage, but let call detect gender per usual. I've tested it on a small subset of my sample set and the genders are correctly handled.

@drmrgd
Copy link

drmrgd commented Mar 19, 2021

@eriktoo That looks like a nice workaround. Thank you! I started trying code out my own QC metrics and stats to take the place of those missing, but this would be a much better solution. I'll give this a shot when I can break free. Really appreciate all of the help!

@tskir
Copy link
Collaborator

tskir commented Mar 19, 2021

@eriktoo @drmrgd Just a heads up that this issue will be closed once the fix in #580 is merged; discussing the implementation of genemetrics stats output is better scoped for #577.

By the way, thank you both for reporting and commenting!

@eriktoo
Copy link

eriktoo commented Mar 19, 2021

@tskir Thanks for the heads up. I have added the workaround as a comment to #577 and referenced it at the top of the issue.

Glad to help! Thanks to you and the team for devoting time and energy to this project!

@etal etal closed this as completed in #580 Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants