Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOH / state4,cn=2 is never called #92

Closed
Nicolai-vKuegelgen opened this issue Sep 26, 2022 · 12 comments
Closed

LOH / state4,cn=2 is never called #92

Nicolai-vKuegelgen opened this issue Sep 26, 2022 · 12 comments

Comments

@Nicolai-vKuegelgen
Copy link

I am currently trying to set up a (mostly) automated CNV calling pipeline including PennCNV. Installation (from source) worked without issue as does running the program itself.

However, when investigating the results for a large set of samples (>150), I never see even a single detected loss of heterozygostiy (LOH) state, which should have cn=2 (state4 in the hmm model afaik). I am working with data has been analysed before (manually with GenomeStudio), I mostly know what to expect (and gain/loss CNVs are matching these expectations).

This is the command I'm using:
PennCNV-1.0.5/detect_cnv.pl -test -hmm PennCNV-1.0.5/lib/hhall.hmm -pfb compiled-samples.pfb array-data-tsv/*.tsv -out array-results/full_analysis_run.txt -log array-results/full_analysis_run.log -confidence

I've created my pfb file by running compile_pfb.pl on all samples. I imagine that this could maybe be an issue, as the samples are all from celllines, several of which of which might be derived from another. So overall heterogeneity between samples is likely lower than usual for a dataset of comparable size.

@kaichop
Copy link
Contributor

kaichop commented Sep 26, 2022 via email

@Nicolai-vKuegelgen
Copy link
Author

Thanks for that help. Sadly it did not work yet.

I've changed this line in the hmm file:

B1_mean:
-3.527211 -0.664184 0.000000 100.000000 0.395621 0.678345 

To this:

B1_mean:
-3.527211 -0.664184 0.000000 0.000000 0.395621 0.678345

Is there anything else I need to do? So far I still don't get any LOH calls.

@kaichop
Copy link
Contributor

kaichop commented Oct 4, 2022

can you show the actual output (LOG and WARNING message for example) after running the command to help diagnose the issue.

@Nicolai-vKuegelgen
Copy link
Author

This is what the log file shows for a single sample (there is little difference between samples, apart from exact numbers, some samples have a GC-waviness warning).

The 'WARNING: Unrecognizable LRR or BAF values are treated as zero:' line is repeated quite often when I'm using unfiltered input files, but even if I pre-filter the input I still don't get LOH calls.

NOTICE: All program notification/warning messages that appear in STDERR will be also written to log file array-results/full_analysis_run_loh.log
NOTICE: Reading marker coordinates and population frequency of B allele (PFB) from test.pfb ... Done with 677378 records (2888 records in chr 0,MT,XY were discarded)
NOTICE: Reading LRR and BAF values for from array-data-tsv/206674130142/206674130142_R11C01.tsv ...WARNING: Unrecognizable LRR or BAF values are treated as zero: LRR=-nan BAF=-nan
WARNING: Unrecognizable LRR or BAF values are treated as zero: LRR=-nan BAF=-nan
[...]
WARNING: Unrecognizable LRR or BAF values are treated as zero: LRR=-nan BAF=-nan
 Done with 677378 records in 24 chromosomes (52681 records are discarded due to lack of PFB information for the markers)
NOTICE: Data from chromosome X,Y will not be used in analysis
NOTICE: Median-adjusting LRR values for all autosome markers from array-data-tsv/206674130142/206674130142_R11C01.tsv by -0.0015
NOTICE: Median-adjusting BAF values for all autosome markers from array-data-tsv/206674130142/206674130142_R11C01.tsv by -0.0311
NOTICE: quality summary for array-data-tsv/206674130142/206674130142_R11C01.tsv: LRR_mean=0.0121 LRR_median=0.0000 LRR_SD=0.1472 BAF_mean=0.4991 BAF_median=0.5000 BAF_SD=0.0393 BAF_DRIFT=0.000288 WF=-0.0183 GCWF=-0.0108

@kaichop
Copy link
Contributor

kaichop commented Oct 5, 2022

I do not see anything wrong in the LOG information, except the 52K records that are discarded. If you do not mind just emailing one example file (together with your PFB), I can do some diagnosis to see what is wrong. Also you may want to check whether your PFB file contains allele frequency spectrum from 0 to 1 (it is fine to have lower heterozygosity than typical sample collections), rather than being only 0 or 1.

@Nicolai-vKuegelgen
Copy link
Author

The discarded records probes that don't have usable information - I will likely filter these out in the future before running analysis. The PFB file also looks proper from what I can tell (pfb values are continous between 0 and 1)

If you could look into this issue, that would be very much appreciated, I've checked and there should be no issue with sending you one of the files. I'll assume I can send the files to your chop.edu email?

@Nicolai-vKuegelgen
Copy link
Author

Hey Kai,

I hope you got my email with the example files, if not please let me know so I can re-send it.

@kaichop
Copy link
Contributor

kaichop commented Oct 21, 2022 via email

@Nicolai-vKuegelgen
Copy link
Author

Since I already had an drive (like) link in my last email and the mail is censoered in comments here, I've now resend it to you gmail adress (from the github commit logs).
Thanks!

@kaichop
Copy link
Contributor

kaichop commented Oct 26, 2022 via email

@kaichop
Copy link
Contributor

kaichop commented Oct 27, 2022

I was able to generate LOH calls after editing 100 to 0 in the hhall.hmm file and after adding the "-loh" argument in command line. I forgot to mention the latter requirement.

./detect_cnv.pl -test -hmm lib/hhall.hmm -pfb /mounted/test.pfb /mounted/204362030005_R07C01.signal.txt -gcmodel test.gcmodel -loh

chr12:95690404-95944219 numsnp=113 length=253,816 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=rs2431014 endsnp=GSA-rs7486703
chr13:56535612-58592616 numsnp=212 length=2,057,005 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=GSA-rs189908706 endsnp=rs35573682
chr13:84478342-85583979 numsnp=176 length=1,105,638 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=rs9546709 endsnp=GSA-rs76945249
chr22:41209635-42122422 numsnp=295 length=912,788 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=GSA-rs139438 endsnp=rs764481

I noted that there are lots of negative PFB values, which cannot be right. It should be made to positive. I am not sure how it happened but perhaps there are negative BAF values in your input files that were used to generate the PFB?

Also this particular sample is very "wavy" which you can see from the figure below. You can consider using gcmodel to adjust for waviness.

204362030005_R07C01 signal txt chr3 110110185

@Nicolai-vKuegelgen
Copy link
Author

Thanks !

The missing -loh option must have been the reason why I didn't get any LOH calls.

I've also noticed that several of the samples I have are quity wavy, but I haven't gotten around to making the GCmodel files yet. The negative PFB values probably do come from negative BAF values (I didn't use genomestudio which apparently automatcially restricts BAF to [0,1] while the underlying algorythm doesn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants