Calling with multiple bams with well-defined read groups #359

simonohanlon101 · 2017-01-31T10:51:51Z

Hi,
I previously followed a GATK pipeline, which I have abandoned in favour of freebayes. However, one thing that did make sense to me in GATK, was the notion that the read group identifier should be the same for all reads run on the same flowcell and the same lane, i.e. we can expect these reads to have the same error distribution caused by sequencer cycle. The SM tag can then be used to differentiate reads from different samples that have been multiplexed into the same lane.

In freebayes, it seems that the read group ID tag uniquely identifies reads from a particular sample by also (for instance) including the sample name in the ID tag. So in my case I have a BAM file with reads from the same sample, which was run across multiple lanes, and actually multiple flowcells, so I have read groups for each flowcell.lane combination:

@RG ID:AC6K7AANXX.1 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@RG ID:AC6K7AANXX.2 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@RG ID:AC6K7AANXX.3 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@RG ID:AC8GC2ANXX.8 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ

You'll notice the ID tag is <FLOWCELL>.<LANE>. I have another sample which was run on the same flowcell and same lanes, and it's read groups are set up like so:

@RG ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ

Currently (as stated in the docs) freebayes appears to require each of these read groups to be unique. Is there some way for freebayes to be updated to also use the SM tag to create unique read groups if duplicates are found. In this way the read groups will maintain compatibility with the GATK model of specifying read groups?

Many thanks for such an excellent piece of software!

Cheers,

Simon

The text was updated successfully, but these errors were encountered:

ekg · 2017-01-31T13:44:55Z

When you try running freebayes you observe that there is one sample output per read group? I believe this would be a serious bug so please let me know. If not, then the bug is in the documentation and we need to resolve that so that it correctly explains what it is doing.

…

On Tue, Jan 31, 2017, 10:51 AM simonohanlon101 ***@***.***> wrote: Hi, I previously followed a GATK pipeline, which I have abandoned in favour of freebayes. However, one thing that did make sense to me in GATK, was the notion that the read group identifier should be the same for all reads run on the same flowcell and the same lane, i.e. we can expect these reads to have the same error distribution caused by sequencer cycle. The SM tag can then be used to differentiate reads from different samples that have been multiplexed into the same lane. In freebayes, it seems that the read group ID tag uniquely identifies reads from a particular sample by also (for instance) including the sample name in the ID tag. So in my case I have a BAM file with reads from the same sample, which was run across multiple lanes, and actually multiple flowcells, so I have read groups for each flowcell.lane combination: @rg ID:AC6K7AANXX.1 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ @rg ID:AC6K7AANXX.2 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ @rg ID:AC6K7AANXX.3 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ @rg ID:AC8GC2ANXX.8 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ You'll notice the ID tag is <FLOWCELL>.<LANE>. I have another sample which was run on the same flowcell and same lanes, and it's read groups are set up like so: @rg ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ Currently (as stated in the docs) freebayes appears to require each of these read groups to be unique. Is there some way for freebayes to be updated to also use the SM tag to create unique read groups if duplicates are found. In this way the read groups will maintain compatibility with the GATK model of specifying read groups? Many thanks for such an excellent piece of software! Cheers, Simon — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#359>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI4EaUZ2f1OcoHXWyNTbnVQO3aSRCL7ks5rXxJHgaJpZM4LyesU> .

simonohanlon101 · 2017-01-31T14:36:40Z

Hi Erik,
No, I observe just one sample output in total. What freebayes does do is ignore any remaining BAMs. If I specify the BAMs on the command line, freebayes does this silently (ie. it will output a VCF file for a single sample - the first BAM specified on the command line):

freebayes -f my_ref.fa 02.OZ.bam 03.OZ.bam

I will get a nice VCF for 02.OZ, but 03.OZ is ignored and does not appear in the output. If I instead specify both the files and the sample names like so and capture the debug output (note in the output below I ran this example on a small region for debug purposes):

freebayes -f my_ref.fa -L my.bams.txt -s my.samples.txt -d 2> fb.debug

I see this:

loading fasta reference my_ref.fa
Opening 2 BAM fomat alignment input files
done
Number of ref seqs: 69
will process reference sequence Supercontig_1.1:1..5000
Number of target regions: 1
ERROR(freebayes): sample 03.OZ listed in sample file /home/sohanlon/fb.samples is not listed in the header of BAM file(s)

However the sample 03.OZ really does exist in the header of the 03.OZ.bam BAM file:

$samtools view -H 03.OZ.bam | grep ^@RG
@RG ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ

So I assume that freebayes does not like that the ID tag of the @RG fields are the same between the different samples (because they were run on the same flowcells and same lanes). It is the SM tag that differentiates them (you can see how my actual @RG lines are set up in my OP above).

As a workaround I am going to reheader my bam files so the @RG:Id for 03.OZ will look like this:

@RG ID:AC6K7AANXX.1.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.2.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.3.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC8GC2ANXX.8.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ

But this will have the annoying side-effect that any tool in the GATK pipeline would not be able to include covariates at the level of flowcell lane in things like base quality score recalibration for instance. I hope my explanation makes sense?

ekg · 2017-01-31T14:56:45Z

So the rule you'd suggest would be to identify alignments by sample, read group, and file, merging cases where sample and read group are the same? Is there any way that this might cause problems for existing users?

…

On Tue, Jan 31, 2017, 2:36 PM simonohanlon101 ***@***.***> wrote: Hi Erik, No, I observe just one sample output in total. What freebayes *does* do is ignore any remaining BAMs. If I specify the BAMs on the command line, freebayes does this silently (ie. it will output a VCF file for a single sample - the first BAM specified on the command line): freebayes -f my_ref.fa 02.OZ.bam 03.OZ.bam I will get a nice VCF for 02.OZ, but 03.OZ is ignored and does not appear in the output. If I instead specify both the files and the sample names like so and capture the debug output (note in the output below I ran this example on a small region for debug purposes): freebayes -f my_ref.fa -L my.bams.txt -s my.samples.txt -d 2> fb.debug I see this: loading fasta reference my_ref.fa Opening 2 BAM fomat alignment input files done Number of ref seqs: 69 will process reference sequence Supercontig_1.1:1..5000 Number of target regions: 1 ERROR(freebayes): sample 03.OZ listed in sample file /home/sohanlon/fb.samples is not listed in the header of BAM file(s) However the sample 03.OZ really does exist in the header of the 03.OZ.bam BAM file: $samtools view -H 03.OZ.bam | grep ***@***.*** @rg ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ So I assume that freebayes does not like that the ID tag of the @rg fields are the same between the different samples (because they were run on the same flowcells and same lanes). It is the SM tag that differentiates them (you can see how my actual @rg lines are set up in my OP above). As a workaround I am going to reheader my bam files so the @rg:Id for 03.OZ will look like this: @rg ID:AC6K7AANXX.1.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC6K7AANXX.2.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC6K7AANXX.3.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ @rg ID:AC8GC2ANXX.8.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ But this will have the annoying side-effect that any tool in the GATK pipeline would not be able to include covariates at the level of flowcell lane in things like base quality score recalibration for instance. I hope my explanation makes sense? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#359 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI4Ed0Hkzg7H3svZOQtXUoESIFv9XrZks5rX0b5gaJpZM4LyesU> .

yifangt · 2018-05-23T21:32:50Z

I seem to have some issues related to this problem.

$ ERROR(freebayes): multiple samples (SM) map to the same read group (RG)
 samples SF_Trt02 and SF_Trt03 map to SF
As freebayes operates on a virtually merged stream of its input files,
it will not be possible to determine what sample an alignment belongs to at runtime.

To resolve the issue, ensure that RG ids are unique to one sample
across all the input files to freebayes.

See bamaddrg (https://github.com/ekg/bamaddrg) for a method which can
add RG tags to alignments.

My command line and bam_files.list are:

freebayes -L bam_files.list -f genome.fasta -r chr6D_part2:0-23083594 -v BB_vs_SF_freebayes_chr6D_2.vcf

results/SNP/BB_Ctr01_rg.bam
results/SNP/BB_Ctr02_rg.bam
results/SNP/BB_Ctr03_rg.bam
results/SNP/BB_Trt01_rg.bam
results/SNP/BB_Trt02_rg.bam
results/SNP/BB_Trt03_rg.bam
results/SNP/SF_Ctr01_rg.bam
results/SNP/SF_Ctr02_rg.bam
results/SNP/SF_Ctr03_rg.bam
results/SNP/SF_Trt01_rg.bam
results/SNP/SF_Trt02_rg.bam
results/SNP/SF_Trt03_rg.bam

Is there any update on this thread?

Shenglai · 2019-01-28T17:17:52Z

I seem to have some issues related to this problem.

$ ERROR(freebayes): multiple samples (SM) map to the same read group (RG)
 samples SF_Trt02 and SF_Trt03 map to SF
As freebayes operates on a virtually merged stream of its input files,
it will not be possible to determine what sample an alignment belongs to at runtime.

To resolve the issue, ensure that RG ids are unique to one sample
across all the input files to freebayes.

See bamaddrg (https://github.com/ekg/bamaddrg) for a method which can
add RG tags to alignments.

My command line and bam_files.list are:

freebayes -L bam_files.list -f genome.fasta -r chr6D_part2:0-23083594 -v BB_vs_SF_freebayes_chr6D_2.vcf

results/SNP/BB_Ctr01_rg.bam
results/SNP/BB_Ctr02_rg.bam
results/SNP/BB_Ctr03_rg.bam
results/SNP/BB_Trt01_rg.bam
results/SNP/BB_Trt02_rg.bam
results/SNP/BB_Trt03_rg.bam
results/SNP/SF_Ctr01_rg.bam
results/SNP/SF_Ctr02_rg.bam
results/SNP/SF_Ctr03_rg.bam
results/SNP/SF_Trt01_rg.bam
results/SNP/SF_Trt02_rg.bam
results/SNP/SF_Trt03_rg.bam

Is there any update on this thread?

Hi, it seems I am having the same issue. To be honest, it is very hard to force users to align the bam files using unique read group ids, especially when gathering bam files from different groups. It would be better to determine samples on platform unit (PU) in my opinion.

dminkley · 2019-05-09T17:31:50Z

First off, thanks for working on freebayes, it seems great so far!

I'm just going to throw my hat into the ring and say that our group was experiencing the same issue - we had prepped files for GATK and when running freebayes we encountered a 'duplicate ID' issue. I think it would be valuable to at least have an optional flag that allows the user to select what piece of readgroup information to use as a unique identifier. According to the GATK specification, Platform Unit (PU) seems to be the most specific specification, but Sample ID also makes sense to me.

If it's helpful, I've been looking for a open source project to get involved in, and would be happy to poke around the code and try to implement this if you'd like!

ekg · 2019-05-09T18:10:58Z

Each sample (SM) can contain a number of read groups (RG). This has been a typical usage of these tags for a long time. The alignments aren't tagged with their sample, but with the read group. This is unfortunately how things were done, because many tools were using read groupings for recalibration or in their variant calling model, so sample was too coarse a division. Does your data setup require you to have multiple samples in the same read group? Did you intend to do this? And if so, what is the reason for your change in the interpretation of @sm and @rg? Note that the GATK documentation isn't the standard for the BAM format. You should refer to the specs here: https://github.com/samtools/hts-specs If you are having this problems, use a read group setter in Picard or bamaddrg which the error message suggests you use. I'm not inclined to change this behavior unless there has been a change to the standard.

…

On Thu, May 9, 2019 at 7:31 PM dminkley ***@***.***> wrote: I'm just going to throw my hat into the ring and say that our group was experiencing the same issue - we had prepped files for GATK and when running freebayes we encountered a 'duplicate ID' issue. I think it would be valuable to at least have a optional flag that allows the user to select what piece of readgroup information to use as a unique identifier. According to the GATK specification, Platform Unit (PU) seems to be the most specific specification, but Sample ID also makes sense to me. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#359 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABDQENHXFU5TMY54U7XKV3PURNYNANCNFSM4C6J5MKA> .

dminkley · 2019-05-09T18:41:13Z

Thanks for the quick response!

We didn't intend to change the definition of any of the tags - we were just going off of what the GATK documentation says (which I now appreciate is NOT the standard). In that case ID is <machine>.<lane>; SM corresponds to the actual sample individual/source; LB corresponds to the actual sample prep; and PU corresponds to <machine>.<lane>.<barcode>

I take your point that @sm doesn't make much sense - my understanding was that GATK would use this as a group identifier for recalibration purposes in the absence of something more specific (eg PU). We do indeed have some cases where we have multiple read groups within a sample, which in our case would be differentiated based on both the ID (the read groups came from different lanes) and PU.

Our issue comes from GATK defining ID as <machine id>.<lane>. This implies that different samples could have the same RG ID (in our case in different bam files corresponding to different samples) if they were multiplexed in such a way that different samples were processed in the same lane. In this case being able to use something like PU would allow different read entities to be uniquely differentiated.

I know my discussion of GATK's practice isn't especially relevant to freebayes (and is probably frustrating), and the approach you've taken seems to make more sense TBH. We're already in the process of re-naming our IDs to work with freebayes so it's not a huge issue by any means. My intention was to just add my support to the list of other users who have encountered the same issue/circumstances.

ZhenyuZ · 2019-05-10T17:13:28Z

Similar issue here.

I totally agree that freebayes is implemented correctly, and these are data issues as the BAM file do not satisfy the uniqueness requirement in SAM Spec. However, in the real world, people generated data with these violations, and it is very difficult to fix read group ID in BAMs.

I am wondering if it is possible to implement a feature that can allow users to specify uniqueness of read group, that could maximize the compatibility of freebayes with those imperfect data input. just my 2 cents

github-actions · 2020-12-11T18:14:10Z

This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

github-actions · 2020-12-18T01:59:02Z

This issue was closed for lack of activity. Feel free to re-open if someone feels like working on it.

Shenglai mentioned this issue Feb 5, 2019

[Freebayes]: duplicated readgroup ids are not allowed for freebayes uc-cdis/genomel_pipelines#42

Closed

github-actions bot added the Stale label Dec 11, 2020

github-actions bot closed this as completed Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling with multiple bams with well-defined read groups #359

Calling with multiple bams with well-defined read groups #359

simonohanlon101 commented Jan 31, 2017

ekg commented Jan 31, 2017 via email

simonohanlon101 commented Jan 31, 2017

ekg commented Jan 31, 2017 via email

yifangt commented May 23, 2018 •

edited

Shenglai commented Jan 28, 2019

dminkley commented May 9, 2019 •

edited

ekg commented May 9, 2019 via email

dminkley commented May 9, 2019 •

edited

ZhenyuZ commented May 10, 2019 •

edited

github-actions bot commented Dec 11, 2020

github-actions bot commented Dec 18, 2020

Calling with multiple bams with well-defined read groups #359

Calling with multiple bams with well-defined read groups #359

Comments

simonohanlon101 commented Jan 31, 2017

ekg commented Jan 31, 2017 via email

simonohanlon101 commented Jan 31, 2017

ekg commented Jan 31, 2017 via email

yifangt commented May 23, 2018 • edited

Shenglai commented Jan 28, 2019

dminkley commented May 9, 2019 • edited

ekg commented May 9, 2019 via email

dminkley commented May 9, 2019 • edited

ZhenyuZ commented May 10, 2019 • edited

github-actions bot commented Dec 11, 2020

github-actions bot commented Dec 18, 2020

yifangt commented May 23, 2018 •

edited

dminkley commented May 9, 2019 •

edited

dminkley commented May 9, 2019 •

edited

ZhenyuZ commented May 10, 2019 •

edited