-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpreting methylation haplotypes #5
Comments
Hi Chris,
I have not used IGV in bisulfite CG mode before, so I'm a bit confused. I
have used IGV in normal mode, but C (blue) and T (red) on the forward
strand informs the methylation for the CpG positions while G (orange) and A
(green) informs the methylation for the reverse strand. In the first IGV
example, all of the forward reads where correctly turned into haplotypes:
first haplotype is from the second read from the top, second haplotype is
from combining the PE reads which are third and fifth from the top. But
the last haplotypes were interpreted incorrectly. This is known problem
with the bam2cghap_v1.sh code which expects a certain SAM flag for reverse
strand reads to be considered as reverse strand. I suggest to just use
bam2cghap.sh which is the latest version that I know works for BisMark and
BisReadMapper bam files or similar.
In the second IGV example, bam2cghap_v1.sh was run in WGBS mode, when
CCT.bam is from an RRBS library. In WGBS mode bam2cghap_v1.sh will try to
remove PCR duplicates based only on the starting map position of the reads
therefore, most of those reads were considered duplicates. Also
bam2cghap_v1.sh might be having trouble with the SAM flags here too. It
looks like three random reads were chosen for each unique starting map
position and converted to haplotypes in the output. Let me try to run the
code and figure out what happened in the IGV plot for this region. Note
that in the latest version of bam2cghap.sh, we took away the remove PCR
duplicates since they are better removed with samtools rmdup program.
Best,
Dinh
…On Fri, Jan 19, 2018 at 1:23 PM, Chris Lo ***@***.***> wrote:
Hi Dinh,
I'm running your tool to extract methylation haplotypes from bam files,
and I'm currently comparing the reads displayed in IGV with the methylation
haplotypes generated from your tool. I'm seeing discrepancies, and I'm
wondering if you can help me interpret the results.
My bam was aligned by bsmap, and the following command was used to
generate the haplotype file:
sh scripts/bam2cghap_v1.sh WGBS ng.3805/MHBS.txt
/allcpg/hg19.fa.allcpgs.txt.gz test.bam test
I compared the output with IGV:
[image: example]
<https://user-images.githubusercontent.com/17771008/35169087-2ef1157e-fd29-11e7-9a12-a504b5194dd9.png>
The screenshot is from IGV's bisulfite CG mode, in which everything in
blue are bases that are not methylated at CG motifs, referring to T's in
your haplotype block output. If a base is methylated, it would be colored
red, referring to C's in your haplotype block output. I have added the
genomic coordinates right below the reads in IGV, and below the screenshot
is the output from bam2cghap_v1.sh
Something isn't matching up: we don't see any reads with C's in IGV, and
the T's doesn't account for all the reads in the region in IGV. Not sure if
you have use IGV as a gold standard to compare your methylation haplotypes.
Could the aligner (bsmap) be contributing to this?
A second example, from your data:
./scripts/make-mappable-bins.sh BAMfiles/CCT.bam 10
sh scripts/bam2cghap_v1.sh WGBS CCT.RD10_80up.genomecov.bed
allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT
[image: screen shot 2018-01-19 at 3 34 00 pm]
<https://user-images.githubusercontent.com/17771008/35170577-bd9ac338-fd2e-11e7-9242-d9adddc03b06.png>
And here's the haplotype file in the region:
chr11:377279-377364 CCCCCCCCCCCCCC 1 377282,377288,377304,377312,
377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
chr11:377279-377364 CCCCCCCCCCTCCC 1 377282,377288,377304,377312,
377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
chr11:377279-377364 CCCCCCCCTTCCCC 1 377282,377288,377304,377312,
377314,377316,377321,377324,377336,377340,377344,377346,377348,377356
Again, not all the reads in IGV account for the haplotype file, and I have
checked that none of the reads here are PCR duplicates.
Would be interested to hear what you think about this.
Thanks,
Chris
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAgCIICsgmpmJCN9vqHSSd5_gXOnSSQ4ks5tMQftgaJpZM4RlEaV>
.
|
Hi Dinh, Thanks for your detailed reply regarding this. I first reran the CCT sample with the following commands:
and the looking at Here's one example that I have questions about: Reading from the top, the first read has the same name as the third read, and the second read has the same read name as the fourth read. Here is the haplotype information:
The second haplotype is accounted by reads 2 and 4 (reading from the top), but I'm not sure how the first haplotype is accounted for. There is a lot of mismatches in the last two reads, and I'm wondering if the tool is using those mismatches as methylation information for the haplotypes. I see patterns of mismatch in which C->T happens on reads in which G->A is allowed by bisulfite conversion, and vise versa. For example, the third read is Read 2, Positive Strand so it should have G->A mismatches, but I see C->T mismatches instead. Therefore, I'm wondering if your rules of bisulfite mismatch is different than what I (and IGV) expect to see in pair end bislfuite treated data, because upon running my own data on the commands you suggested above and comparing to IGV the haplotype information still doesn't match. Regarding what you said about the coloring in IGV:
I don't think that's quite right. The allowable mismatches from bisulfite-treated pair-end sequencing should be: positive strand, read 1: C->T Because reads from bams are always stored in the positive orientation, the allowable mismatches for negative strands need to be complemented (which is what we see in IGV): negative strand, read 1: G->A Best, |
Hi Chris,
I downloaded the CCT.bam file and visualized it in IGV normal mode and what
I got matches the haplotypes.
[image: Inline image 1]
These BAM files were generated such that forward orientation means that the
read was from the Crick strand and reverse orientation means that the read
was from the Watson strand. Therefore, both read 1 and read 2 in PE reads
would be oriented the same way since they both should have mapped to the
Crick strand. Furthermore, all forward oriented reads (from Crick) will
have C->T and all reverse oriented reads (from Watson) would be G->A
regardless of whether they are read 1 or read 2.
I believe both BisReadMapper and Bismark will output BAM files in this way
because we want to preserve the original strand information for each read.
~ Dinh
…On Fri, Jan 26, 2018 at 12:14 PM, Chris Lo ***@***.***> wrote:
Hi Dinh,
Thanks for your detailed reply regarding this. I first reran the CCT
sample with the following commands:
sh ./scripts/bam2cghap_orig.sh allcpg/hg19.fa.allcpgs.txt.gz
BAMfiles/CCT.bam CCT
touch CCT_hapinfo_list
echo CCT.cgPE.hapinfo.txt > CCT_hapinfo_list
sh ./scripts/cghap2matrix.sh CCT_hapinfo_list MHL CCT_MHB ng.3805/MHBS.txt
and the looking at CCT.cgPE.hapinfo.txt.merged.hapinfo.txt is easier
because I can look at a haplotype region at a time.
Here's one example that I have questions about:
[image: screen shot 2018-01-26 at 2 34 46 pm]
<https://user-images.githubusercontent.com/17771008/35457210-1f741600-02a6-11e8-87fe-061ecb807358.png>
Reading from the top, the first read has the same name as the third read,
and the second read has the same read name as the fourth read. Here is the
haplotype information:
chr6:33048642:33048705 TTT 1 33048653:33048678:33048687
chr6:33048642:33048705 CCC 1 33048653:33048678:33048687
The second haplotype is accounted by reads 2 and 4 (reading from the top),
but I'm not sure how the first haplotype is accounted for. There is a lot
of mismatches in the last two reads, and I'm wondering if the tool is using
those mismatches as methylation information for the haplotypes.
I see patterns of mismatch in which C->T happens on reads in which G->A is
allowed by bisulfite conversion, and vise versa. For example, the third
read is Read 2, Positive Strand so it should have G->A mismatches, but I
see C->T mismatches instead.
Therefore, I'm wondering if your rules of bisulfite mismatch is different
than what I (and IGV) expect to see in pair end bislfuite treated data,
because upon running my own data on the commands you suggested above and
comparing to IGV the haplotype information still doesn't match. Regarding
what you said about the coloring in IGV:
I have used IGV in normal mode, but C (blue) and T (red) on the forward
strand informs the methylation for the CpG positions while G (orange) and A
(green) informs the methylation for the reverse strand.
I don't think that's quite right. The allowable mismatches from
bisulfite-treated pair-end sequencing should be:
positive strand, read 1: C->T
positive strand, read 2: G->A
negative strand, read 1: C->T
negative strand, read 2: G->A
Because reads from bams are always stored in the positive orientation, the
allowable mismatches for negative strands need to be complemented (which is
what we see in IGV):
negative strand, read 1: G->A
negative strand, read 2: C->T
Best,
Chris
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAgCIM51wpbAI80fQtXAmQDSN13pPboKks5tOjIvgaJpZM4RlEaV>
.
|
Hi Dinh, Is your data generated from a directional bisulfite sequencing protocol, or a non-directional bisulfite sequencing protocol? This might explain the differences we are viewing the data. Best, |
Hi Chris,
The RRBS libraries are non-directional.
Thanks,
Dinh
…On Tue, Jan 30, 2018 at 10:51 AM, Chris Lo ***@***.***> wrote:
Hi Dinh,
Is your data generated from a directional bisulfite sequencing protocol,
or a non-directional bisulfite sequencing protocol? This might explain the
differences we are viewing the data.
Best,
Chris
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAgCIMn-WMmtHHXSeokofzfnkG0QLE4Lks5tP2TDgaJpZM4RlEaV>
.
|
Hi Dinh,
I'm running your tool to extract methylation haplotypes from bam files, and I'm currently comparing the reads displayed in IGV with the methylation haplotypes generated from your tool. I'm seeing discrepancies, and I'm wondering if you can help me interpret the results.
My bam was aligned by bsmap, and the following command was used to generate the haplotype file:
sh scripts/bam2cghap_v1.sh WGBS ng.3805/MHBS.txt /allcpg/hg19.fa.allcpgs.txt.gz test.bam test
I compared the output with IGV:
The screenshot is from IGV's bisulfite CG mode, in which everything in blue are bases that are not methylated at CG motifs, referring to T's in your haplotype block output. If a base is methylated, it would be colored red, referring to C's in your haplotype block output. I have added the genomic coordinates right below the reads in IGV, and below the screenshot is the output from
bam2cghap_v1.sh
Something isn't matching up: we don't see any reads with C's in IGV, and the T's doesn't account for all the reads in the region in IGV. Not sure if you have use IGV as a gold standard to compare your methylation haplotypes. Could the aligner (bsmap) be contributing to this?
A second example, from your data:
./scripts/make-mappable-bins.sh BAMfiles/CCT.bam 10
sh scripts/bam2cghap_v1.sh WGBS CCT.RD10_80up.genomecov.bed allcpg/hg19.fa.allcpgs.txt.gz BAMfiles/CCT.bam CCT
And here's the haplotype file in the region:
Again, not all the reads in IGV account for the haplotype file, and I have checked that none of the reads here are PCR duplicates.
Would be interested to hear what you think about this.
Thanks,
Chris
The text was updated successfully, but these errors were encountered: