-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calling with multiple bams with well-defined read groups #359
Comments
When you try running freebayes you observe that there is one sample output
per read group? I believe this would be a serious bug so please let me
know. If not, then the bug is in the documentation and we need to resolve
that so that it correctly explains what it is doing.
…On Tue, Jan 31, 2017, 10:51 AM simonohanlon101 ***@***.***> wrote:
Hi,
I previously followed a GATK pipeline, which I have abandoned in favour of
freebayes. However, one thing that did make sense to me in GATK, was the
notion that the read group identifier should be the same for all reads run
on the same flowcell and the same lane, i.e. we can expect these reads to
have the same error distribution caused by sequencer cycle. The SM tag
can then be used to differentiate reads from different samples that have
been multiplexed into the same lane.
In freebayes, it seems that the read group ID tag uniquely identifies
reads from a particular sample by also (for instance) including the sample
name in the ID tag. So in my case I have a BAM file with reads from the
same sample, which was run across multiple lanes, and actually multiple
flowcells, so I have read groups for each flowcell.lane combination:
@rg ID:AC6K7AANXX.1 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@rg ID:AC6K7AANXX.2 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@rg ID:AC6K7AANXX.3 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@rg ID:AC8GC2ANXX.8 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
You'll notice the ID tag is <FLOWCELL>.<LANE>. I have another sample
which was run on the same flowcell and same lanes, and it's read groups are
set up like so:
@rg ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
Currently (as stated in the docs) freebayes appears to require each of
these read groups to be unique. Is there some way for freebayes to be
updated to also use the SM tag to create unique read groups if duplicates
are found. In this way the read groups will maintain compatibility with the
GATK model of specifying read groups?
Many thanks for such an excellent piece of software!
Cheers,
Simon
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#359>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAI4EaUZ2f1OcoHXWyNTbnVQO3aSRCL7ks5rXxJHgaJpZM4LyesU>
.
|
Hi Erik,
I will get a nice VCF for
I see this:
However the sample 03.OZ really does exist in the header of the
So I assume that freebayes does not like that the As a workaround I am going to reheader my bam files so the
But this will have the annoying side-effect that any tool in the GATK pipeline would not be able to include covariates at the level of flowcell lane in things like base quality score recalibration for instance. I hope my explanation makes sense? |
So the rule you'd suggest would be to identify alignments by sample, read
group, and file, merging cases where sample and read group are the same?
Is there any way that this might cause problems for existing users?
…On Tue, Jan 31, 2017, 2:36 PM simonohanlon101 ***@***.***> wrote:
Hi Erik,
No, I observe just one sample output in total. What freebayes *does* do
is ignore any remaining BAMs. If I specify the BAMs on the command line,
freebayes does this silently (ie. it will output a VCF file for a single
sample - the first BAM specified on the command line):
freebayes -f my_ref.fa 02.OZ.bam 03.OZ.bam
I will get a nice VCF for 02.OZ, but 03.OZ is ignored and does not appear
in the output. If I instead specify both the files and the sample names
like so and capture the debug output (note in the output below I ran this
example on a small region for debug purposes):
freebayes -f my_ref.fa -L my.bams.txt -s my.samples.txt -d 2> fb.debug
I see this:
loading fasta reference my_ref.fa
Opening 2 BAM fomat alignment input files
done
Number of ref seqs: 69
will process reference sequence Supercontig_1.1:1..5000
Number of target regions: 1
ERROR(freebayes): sample 03.OZ listed in sample file
/home/sohanlon/fb.samples is not listed in the header of BAM file(s)
However the sample 03.OZ really does exist in the header of the 03.OZ.bam
BAM file:
$samtools view -H 03.OZ.bam | grep ***@***.***
@rg ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
So I assume that freebayes does not like that the ID tag of the @rg
fields are the same between the different samples (because they were run on
the same flowcells and same lanes). It is the SM tag that differentiates
them (you can see how my actual @rg lines are set up in my OP above).
As a workaround I am going to reheader my bam files so the @rg:Id for
03.OZ will look like this:
@rg ID:AC6K7AANXX.1.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC6K7AANXX.2.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC6K7AANXX.3.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@rg ID:AC8GC2ANXX.8.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
But this will have the annoying side-effect that any tool in the GATK
pipeline would not be able to include covariates at the level of flowcell
lane in things like base quality score recalibration for instance. I hope
my explanation makes sense?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#359 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAI4Ed0Hkzg7H3svZOQtXUoESIFv9XrZks5rX0b5gaJpZM4LyesU>
.
|
I seem to have some issues related to this problem.
My command line and bam_files.list are:
Is there any update on this thread? |
Hi, it seems I am having the same issue. To be honest, it is very hard to force users to align the bam files using unique read group ids, especially when gathering bam files from different groups. It would be better to determine samples on platform unit (PU) in my opinion. |
First off, thanks for working on freebayes, it seems great so far! I'm just going to throw my hat into the ring and say that our group was experiencing the same issue - we had prepped files for GATK and when running freebayes we encountered a 'duplicate ID' issue. I think it would be valuable to at least have an optional flag that allows the user to select what piece of readgroup information to use as a unique identifier. According to the GATK specification, Platform Unit (PU) seems to be the most specific specification, but Sample ID also makes sense to me. If it's helpful, I've been looking for a open source project to get involved in, and would be happy to poke around the code and try to implement this if you'd like! |
Each sample (SM) can contain a number of read groups (RG). This has been a
typical usage of these tags for a long time. The alignments aren't tagged
with their sample, but with the read group. This is unfortunately how
things were done, because many tools were using read groupings for
recalibration or in their variant calling model, so sample was too coarse a
division.
Does your data setup require you to have multiple samples in the same read
group? Did you intend to do this? And if so, what is the reason for your
change in the interpretation of @sm and @rg?
Note that the GATK documentation isn't the standard for the BAM format. You
should refer to the specs here:
https://github.com/samtools/hts-specs
If you are having this problems, use a read group setter in Picard or
bamaddrg which the error message suggests you use. I'm not inclined to
change this behavior unless there has been a change to the standard.
…On Thu, May 9, 2019 at 7:31 PM dminkley ***@***.***> wrote:
I'm just going to throw my hat into the ring and say that our group was
experiencing the same issue - we had prepped files for GATK and when
running freebayes we encountered a 'duplicate ID' issue. I think it would
be valuable to at least have a optional flag that allows the user to select
what piece of readgroup information to use as a unique identifier.
According to the GATK specification, Platform Unit (PU) seems to be the
most specific specification, but Sample ID also makes sense to me.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#359 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AABDQENHXFU5TMY54U7XKV3PURNYNANCNFSM4C6J5MKA>
.
|
Thanks for the quick response! We didn't intend to change the definition of any of the tags - we were just going off of what the GATK documentation says (which I now appreciate is NOT the standard). In that case ID is <machine>.<lane>; SM corresponds to the actual sample individual/source; LB corresponds to the actual sample prep; and PU corresponds to <machine>.<lane>.<barcode> I take your point that @sm doesn't make much sense - my understanding was that GATK would use this as a group identifier for recalibration purposes in the absence of something more specific (eg PU). We do indeed have some cases where we have multiple read groups within a sample, which in our case would be differentiated based on both the ID (the read groups came from different lanes) and PU. Our issue comes from GATK defining ID as <machine id>.<lane>. This implies that different samples could have the same RG ID (in our case in different bam files corresponding to different samples) if they were multiplexed in such a way that different samples were processed in the same lane. In this case being able to use something like PU would allow different read entities to be uniquely differentiated. I know my discussion of GATK's practice isn't especially relevant to freebayes (and is probably frustrating), and the approach you've taken seems to make more sense TBH. We're already in the process of re-naming our IDs to work with freebayes so it's not a huge issue by any means. My intention was to just add my support to the list of other users who have encountered the same issue/circumstances. |
Similar issue here. I totally agree that freebayes is implemented correctly, and these are data issues as the BAM file do not satisfy the uniqueness requirement in SAM Spec. However, in the real world, people generated data with these violations, and it is very difficult to fix read group ID in BAMs. I am wondering if it is possible to implement a feature that can allow users to specify uniqueness of read group, that could maximize the compatibility of freebayes with those imperfect data input. just my 2 cents |
This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
This issue was closed for lack of activity. Feel free to re-open if someone feels like working on it. |
Hi,
I previously followed a GATK pipeline, which I have abandoned in favour of freebayes. However, one thing that did make sense to me in GATK, was the notion that the read group identifier should be the same for all reads run on the same flowcell and the same lane, i.e. we can expect these reads to have the same error distribution caused by sequencer cycle. The
SM
tag can then be used to differentiate reads from different samples that have been multiplexed into the same lane.In freebayes, it seems that the read group
ID
tag uniquely identifies reads from a particular sample by also (for instance) including the sample name in theID
tag. So in my case I have a BAM file with reads from the same sample, which was run across multiple lanes, and actually multiple flowcells, so I have read groups for each flowcell.lane combination:You'll notice the
ID
tag is<FLOWCELL>.<LANE>
. I have another sample which was run on the same flowcell and same lanes, and it's read groups are set up like so:Currently (as stated in the docs) freebayes appears to require each of these read groups to be unique. Is there some way for freebayes to be updated to also use the
SM
tag to create unique read groups if duplicates are found. In this way the read groups will maintain compatibility with the GATK model of specifying read groups?Many thanks for such an excellent piece of software!
Cheers,
Simon
The text was updated successfully, but these errors were encountered: