Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run with --cfrac 0.8 #3

Open
sicotte opened this issue Mar 31, 2014 · 11 comments
Open

How to run with --cfrac 0.8 #3

sicotte opened this issue Mar 31, 2014 · 11 comments

Comments

@sicotte
Copy link

sicotte commented Mar 31, 2014

When I specify the input purity, it forces the "cell" option.
The error message says

You need more than 8 data samples. The number of data points must satisfy the relation N > 2xK**2 where K is the number of clusters. The smallest value for K is 2.
at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 5204.

In my version, line 5204 is
my ($ids, $centers) = $kmeans->kmeans();

I do have multiple samples.. so how do I specify the input files to get this to work.

bin/cnv_caller.pl $seg $vcf --sid s_tumor.100 --bam_list $tbam --rid s_normal.100 --rbam_list $nbam --lfrag 200 --merge --smooth --tmp $thisdir/tmp.100 --cfrac 0.8

@carsonhh
Copy link
Contributor

Is this by any chance exome data? The error is basically caused by there
not being enough variants to make needed global calculations work.

--Carson

From: sicotte notifications@github.com
Reply-To: WaveCNV/WaveCNV-caller
<reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git
hub.com>
Date: Monday, March 31, 2014 at 2:47 PM
To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com
Subject: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

When I specify the input purity, it forces the "cell" option.
The error message says

You need more than 8 data samples. The number of data points must satisfy
the relation N > 2xK**2 where K is the number of clusters. The smallest
value for K is 2.
at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 5204.

In my version, line 5204 is
my ($ids, $centers) = $kmeans->kmeans();

I do have multiple samples.. so how do I specify the input files to get this
to work.

bin/cnv_caller.pl $seg $vcf --sid s_tumor.100 --bam_list $tbam --rid
s_normal.100 --rbam_list $nbam --lfrag 200 --merge --smooth --tmp
$thisdir/tmp.100 --cfrac 0.8


Reply to this email directly or view it on GitHub
#3 .

@sicotte
Copy link
Author

sicotte commented Mar 31, 2014

It's 30X whole genome called using GATK. There are ~600-1.8M variants/sample.
I commented out the temp file.. and it only puts 5 segments in the file.
I'm trying to find where in
estimate_copy_fraction and discovery_segments functions
the segments are filtered to see if there is a cutoff I can change.

@carsonhh
Copy link
Contributor

You may have issues with your sample IDs. They need to be identical with VCF
and BAM files you are using. A mismatch could cause MAF values to all be
empty for example.

Also whole genome would normally have 2-3M variants per sample. Do you
really have samples with less than 600,000?

--Carson

From: sicotte notifications@github.com
Reply-To: WaveCNV/WaveCNV-caller
<reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git
hub.com>
Date: Monday, March 31, 2014 at 3:13 PM
To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com
Cc: Carson Holt carsonhh@gmail.com
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

It's 30X whole genome called using GATK. There are ~600-1.8M
variants/sample.
I commented out the temp file.. and it only puts 5 segments in the file.
I'm trying to find where in estimate_copy_fraction, the segments are
filtered to see if there is a cutoff I can change.


Reply to this email directly or view it on GitHub
#3 (comment) .

@sicotte
Copy link
Author

sicotte commented Mar 31, 2014

Those 5 segments have higher than normal coverage.
I checked that the bam files and vcf sampleids were the same.. and that the chromosome numbering of bam,vcf, and segments were all in "chr" format. The bam file was from bwa.

Oh, it requires 2MB segments. Got 275 and 461 in tumor and germline.
The variants are only called if they are in both the germline and tumor..
Yeah, I was surprised to only have ~600K variants so I double-checked the log files.

I created a new variable to define the minimum number of variants per segment.. (to 200 instead of the 1000 that was hardcoded in discovery_segments) .. and that worked.

@carsonhh
Copy link
Contributor

I think your variant file might have been filtered to have somatic variants
only which is not the right thing to do for CNV analysis. While it makes
sense for SNV prioritization, having filters such as somatic filtering and
dbSNP filtering remove useful information that is important for finding
patterns in MAF, LOH, etc.

Thanks,
Carson

From: sicotte notifications@github.com
Reply-To: WaveCNV/WaveCNV-caller
<reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git
hub.com>
Date: Monday, March 31, 2014 at 3:31 PM
To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com
Cc: Carson Holt carsonhh@gmail.com
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Those 5 segments have higher than normal coverage.
I checked that the bam files and vcf sampleids were the same.. and that the
chromosome numbering of bam,vcf, and segments were all in "chr" format. The
bam file was from bwa.

Oh, it requires 2MB segments. Got 275 and 461 in tumor and germline.
The variants are only called if they are in both the germline and tumor..
Yeah, I was surprised too.. so I double-checked the log files.

I created a new variable to define the minimum number of variants per
segment.. (to 200 instead of the 1000 that was hardcoded in
discovery_segments) .. and that worked.


Reply to this email directly or view it on GitHub
#3 (comment) .

@sicotte
Copy link
Author

sicotte commented Apr 1, 2014

You were right. I’ll need to recall the VCF, there are no SNV which have the same genotype in tumor and germline.

Nevertheless, there are still a couple of things that could be made to make the script more robust.

I got the two errors.

Use of uninitialized value $cmp in numeric eq (==) at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 2241.
Use of uninitialized value $cmp in numeric eq (==) at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 2247.

Sort subroutine didn't return a numeric value at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 4866.

From: carsonhh [mailto:notifications@github.com]
Sent: Monday, March 31, 2014 4:53 PM
To: WaveCNV/WaveCNV-caller
Cc: Sicotte, Hugues, Ph.D.
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

I think your variant file might have been filtered to have somatic variants
only which is not the right thing to do for CNV analysis. While it makes
sense for SNV prioritization, having filters such as somatic filtering and
dbSNP filtering remove useful information that is important for finding
patterns in MAF, LOH, etc.

Thanks,
Carson

From: sicotte <notifications@github.commailto:notifications@github.com>
Reply-To: WaveCNV/WaveCNV-caller
<reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git
hub.com>
Date: Monday, March 31, 2014 at 3:31 PM
To: WaveCNV/WaveCNV-caller <WaveCNV-caller@noreply.github.commailto:WaveCNV-caller@noreply.github.com>
Cc: Carson Holt <carsonhh@gmail.commailto:carsonhh@gmail.com>
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Those 5 segments have higher than normal coverage.
I checked that the bam files and vcf sampleids were the same.. and that the
chromosome numbering of bam,vcf, and segments were all in "chr" format. The
bam file was from bwa.

Oh, it requires 2MB segments. Got 275 and 461 in tumor and germline.
The variants are only called if they are in both the germline and tumor..
Yeah, I was surprised too.. so I double-checked the log files.

I created a new variable to define the minimum number of variants per
segment.. (to 200 instead of the 1000 that was hardcoded in
discovery_segments) .. and that worked.


Reply to this email directly or view it on GitHub
#3 (comment) .


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-39146914.

@carsonhh
Copy link
Contributor

carsonhh commented Apr 1, 2014

It's one of those things where I really just want to have more informative error messages as opposed to building in mechanisms that allow the code to continue beyond those failures.

@sicotte
Copy link
Author

sicotte commented Apr 6, 2014

After recalling variants with GATK (and aside from the --fasta bug I reported), I was able to run WaveCNV.pl ..

@carsonhh
Copy link
Contributor

carsonhh commented Apr 7, 2014

Good to know. I'll look at those other issues.

@sicotte
Copy link
Author

sicotte commented Apr 7, 2014

3 feedbacks:
1.
In the “cell” mode, it roughtly gets the (known) purity in 19 out of 25 cases, the other times it’s way off. The linear model fit to maf doesn’t account for subclones which can really mess up the purity estimate even after you take CNV into accounts (Subclone-specific mutations drag the purity estimate down). We’re working on an update to purBayes to better compute purity taking subclones into account.

I notice the code always output an estimated purity as in “cell” mode.
Does this affect the results? How is that estimate used?

  1. Do you have some documentation on the meaning of the output fields.

I had one sample that kept not finishing without any error message.. just the
Log file would get truncated: … e.g.
[Mon Apr 7 09:59:42 2014] Processing: 8%

When I bumped the RAM from 15G to 20G then 31, it went further
If I didn’t crank the RAM, it went no further.
However, I am not convinced it’s a memory issue since the
maxvmem usage in the run that worked said it only used up to 1.15G
Maybe it’s an issue with the memory mapping library (MMapArray) WaveCNV needs (I had to fix a bug (which I reported) in the current version in CPAN before it would compile) when compiling one Perl module that was needed.

From: carsonhh [mailto:notifications@github.com]
Sent: Monday, April 07, 2014 9:13 AM
To: WaveCNV/WaveCNV-caller
Cc: Sicotte, Hugues, Ph.D.
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Good to know. I'll look at those other issues.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-39734550.

@carsonhh
Copy link
Contributor

carsonhh commented Apr 7, 2014

I've found that most of the time the cell mode value is different than the
lab estimated purity, the lab estimate is wrong. At least this was true in
pancreatic cancer where KRAS is used to estimate purity. Unfortunately all
the lab tests assume diploid KRAS, and often it was triploid, tetraploid, or
more (there is a selective advantage for KRAS duplication). We fortunately
had cell lines in addition to primary tumor in those cases, to validate
which calls were more correct. Also the number I give is more correctly
called copy fraction than purity or cellularity (it is the ratio of reads
representing 1 copy in the tumor compared to 1 copy in the normal
contamination - if a tumor is triploid etc. it will not be proportional to
cellularity as that is a cell-to-cell ratio not a copy-to-copy ratio). I'm
not saying that the value you get with the -cell flag is always right, but
lab estimates that use specific genes or markers always way under estimate
contamination when there is a ploidy issue with a marker. In general sub
clones usually aren't different enough or abundant enough to throw off copy
fraction estimates. We've actually been able to use regions that vary from
expected MAF given estimated copy fraction to identify different sub clones
that dominate in xenograft passages (thus allowing us to validate the % of
each sub clone in the primary). In our samples, it was primarily 1 copy LOH
in one subclone mixed with 2 copy LOH in the other that cause the biggest
variance. These types of regions were also the most frequent type of sub
clonal CNV-LOH difference we saw (but still sub clones overwhelmingly
matched in more regions than they differed on 20 samples).

The copy fraction value affects the expected MAF expect at different CN
states.

No documentation is available other than the WaveCNV publication and
supplementary material. I plan on going back to update the code sometime in
the next couple of months. There are a number of outstanding issues
(basically I was trying to get the paper out before moving to a new
institution, so it's not as robust, feature rich, or as well documented as
I'd like it to be).

If one of your samples has a lot of segments (100,000+), it can fail because
I implement some things using berkley DB to make the script restartable, and
it fails frequently when the number of entries in the DB is too high. That
might be the issue you are experiencing.

--Carson

From: sicotte notifications@github.com
Reply-To: WaveCNV/WaveCNV-caller
<reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git
hub.com>
Date: Monday, April 7, 2014 at 9:40 AM
To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com
Cc: Carson Holt carsonhh@gmail.com
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

3 feedbacks:
1.
In the “cell” mode, it roughtly gets the (known) purity in 19 out of 25
cases, the other times it’s way off. The linear model fit to maf doesn’t
account for subclones which can really mess up the purity estimate even
after you take CNV into accounts (Subclone-specific mutations drag the
purity estimate down). We’re working on an update to purBayes to better
compute purity taking subclones into account.

I notice the code always output an estimated purity as in “cell” mode.
Does this affect the results? How is that estimate used?

  1. Do you have some documentation on the meaning of the output fields.
  2. I had one sample that kept not finishing without any error message.. just
    the
    Log file would get truncated: … e.g.
    [Mon Apr 7 09:59:42 2014] Processing: 8%

When I bumped the RAM from 15G to 20G then 31, it went further
If I didn’t crank the RAM, it went no further.
However, I am not convinced it’s a memory issue since the
maxvmem usage in the run that worked said it only used up to 1.15G
Maybe it’s an issue with the memory mapping library (MMapArray) WaveCNV
needs (I had to fix a bug (which I reported) in the current version in CPAN
before it would compile) when compiling one Perl module that was needed.

From: carsonhh [mailto:notifications@github.com]
Sent: Monday, April 07, 2014 9:13 AM
To: WaveCNV/WaveCNV-caller
Cc: Sicotte, Hugues, Ph.D.
Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Good to know. I'll look at those other issues.


Reply to this email directly or view it on
GitHub<#3 (comment)
550>.


Reply to this email directly or view it on GitHub
#3 (comment) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants