Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling with multiple bams with well-defined read groups #359

Closed
simonohanlon101 opened this issue Jan 31, 2017 · 11 comments
Closed

Calling with multiple bams with well-defined read groups #359

simonohanlon101 opened this issue Jan 31, 2017 · 11 comments
Labels

Comments

@simonohanlon101
Copy link

Hi,
I previously followed a GATK pipeline, which I have abandoned in favour of freebayes. However, one thing that did make sense to me in GATK, was the notion that the read group identifier should be the same for all reads run on the same flowcell and the same lane, i.e. we can expect these reads to have the same error distribution caused by sequencer cycle. The SM tag can then be used to differentiate reads from different samples that have been multiplexed into the same lane.

In freebayes, it seems that the read group ID tag uniquely identifies reads from a particular sample by also (for instance) including the sample name in the ID tag. So in my case I have a BAM file with reads from the same sample, which was run across multiple lanes, and actually multiple flowcells, so I have read groups for each flowcell.lane combination:

@RG ID:AC6K7AANXX.1 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@RG ID:AC6K7AANXX.2 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@RG ID:AC6K7AANXX.3 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ
@RG ID:AC8GC2ANXX.8 SM:02.OZ PL:ILLUMINA LB:ngs07_02.OZ

You'll notice the ID tag is <FLOWCELL>.<LANE>. I have another sample which was run on the same flowcell and same lanes, and it's read groups are set up like so:

@RG ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ

Currently (as stated in the docs) freebayes appears to require each of these read groups to be unique. Is there some way for freebayes to be updated to also use the SM tag to create unique read groups if duplicates are found. In this way the read groups will maintain compatibility with the GATK model of specifying read groups?

Many thanks for such an excellent piece of software!

Cheers,

Simon

@ekg
Copy link
Collaborator

ekg commented Jan 31, 2017 via email

@simonohanlon101
Copy link
Author

Hi Erik,
No, I observe just one sample output in total. What freebayes does do is ignore any remaining BAMs. If I specify the BAMs on the command line, freebayes does this silently (ie. it will output a VCF file for a single sample - the first BAM specified on the command line):

freebayes -f my_ref.fa 02.OZ.bam 03.OZ.bam

I will get a nice VCF for 02.OZ, but 03.OZ is ignored and does not appear in the output. If I instead specify both the files and the sample names like so and capture the debug output (note in the output below I ran this example on a small region for debug purposes):

freebayes -f my_ref.fa -L my.bams.txt -s my.samples.txt -d 2> fb.debug

I see this:

loading fasta reference my_ref.fa
Opening 2 BAM fomat alignment input files
done
Number of ref seqs: 69
will process reference sequence Supercontig_1.1:1..5000
Number of target regions: 1
ERROR(freebayes): sample 03.OZ listed in sample file /home/sohanlon/fb.samples is not listed in the header of BAM file(s)

However the sample 03.OZ really does exist in the header of the 03.OZ.bam BAM file:

$samtools view -H 03.OZ.bam | grep ^@RG
@RG ID:AC6K7AANXX.1 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.2 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.3 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC8GC2ANXX.8 SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ

So I assume that freebayes does not like that the ID tag of the @RG fields are the same between the different samples (because they were run on the same flowcells and same lanes). It is the SM tag that differentiates them (you can see how my actual @RG lines are set up in my OP above).

As a workaround I am going to reheader my bam files so the @RG:Id for 03.OZ will look like this:

@RG ID:AC6K7AANXX.1.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.2.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC6K7AANXX.3.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ
@RG ID:AC8GC2ANXX.8.03.OZ SM:03.OZ PL:ILLUMINA LB:ngs07_03.OZ

But this will have the annoying side-effect that any tool in the GATK pipeline would not be able to include covariates at the level of flowcell lane in things like base quality score recalibration for instance. I hope my explanation makes sense?

@ekg
Copy link
Collaborator

ekg commented Jan 31, 2017 via email

@yifangt
Copy link

yifangt commented May 23, 2018

I seem to have some issues related to this problem.

$ ERROR(freebayes): multiple samples (SM) map to the same read group (RG)
 samples SF_Trt02 and SF_Trt03 map to SF
As freebayes operates on a virtually merged stream of its input files,
it will not be possible to determine what sample an alignment belongs to at runtime.

To resolve the issue, ensure that RG ids are unique to one sample
across all the input files to freebayes.

See bamaddrg (https://github.com/ekg/bamaddrg) for a method which can
add RG tags to alignments.

My command line and bam_files.list are:

freebayes -L bam_files.list -f genome.fasta -r chr6D_part2:0-23083594 -v BB_vs_SF_freebayes_chr6D_2.vcf

results/SNP/BB_Ctr01_rg.bam
results/SNP/BB_Ctr02_rg.bam
results/SNP/BB_Ctr03_rg.bam
results/SNP/BB_Trt01_rg.bam
results/SNP/BB_Trt02_rg.bam
results/SNP/BB_Trt03_rg.bam
results/SNP/SF_Ctr01_rg.bam
results/SNP/SF_Ctr02_rg.bam
results/SNP/SF_Ctr03_rg.bam
results/SNP/SF_Trt01_rg.bam
results/SNP/SF_Trt02_rg.bam
results/SNP/SF_Trt03_rg.bam

Is there any update on this thread?

@Shenglai
Copy link

I seem to have some issues related to this problem.

$ ERROR(freebayes): multiple samples (SM) map to the same read group (RG)
 samples SF_Trt02 and SF_Trt03 map to SF
As freebayes operates on a virtually merged stream of its input files,
it will not be possible to determine what sample an alignment belongs to at runtime.

To resolve the issue, ensure that RG ids are unique to one sample
across all the input files to freebayes.

See bamaddrg (https://github.com/ekg/bamaddrg) for a method which can
add RG tags to alignments.

My command line and bam_files.list are:

freebayes -L bam_files.list -f genome.fasta -r chr6D_part2:0-23083594 -v BB_vs_SF_freebayes_chr6D_2.vcf

results/SNP/BB_Ctr01_rg.bam
results/SNP/BB_Ctr02_rg.bam
results/SNP/BB_Ctr03_rg.bam
results/SNP/BB_Trt01_rg.bam
results/SNP/BB_Trt02_rg.bam
results/SNP/BB_Trt03_rg.bam
results/SNP/SF_Ctr01_rg.bam
results/SNP/SF_Ctr02_rg.bam
results/SNP/SF_Ctr03_rg.bam
results/SNP/SF_Trt01_rg.bam
results/SNP/SF_Trt02_rg.bam
results/SNP/SF_Trt03_rg.bam

Is there any update on this thread?

Hi, it seems I am having the same issue. To be honest, it is very hard to force users to align the bam files using unique read group ids, especially when gathering bam files from different groups. It would be better to determine samples on platform unit (PU) in my opinion.

@dminkley
Copy link

dminkley commented May 9, 2019

First off, thanks for working on freebayes, it seems great so far!

I'm just going to throw my hat into the ring and say that our group was experiencing the same issue - we had prepped files for GATK and when running freebayes we encountered a 'duplicate ID' issue. I think it would be valuable to at least have an optional flag that allows the user to select what piece of readgroup information to use as a unique identifier. According to the GATK specification, Platform Unit (PU) seems to be the most specific specification, but Sample ID also makes sense to me.

If it's helpful, I've been looking for a open source project to get involved in, and would be happy to poke around the code and try to implement this if you'd like!

@ekg
Copy link
Collaborator

ekg commented May 9, 2019 via email

@dminkley
Copy link

dminkley commented May 9, 2019

Thanks for the quick response!

We didn't intend to change the definition of any of the tags - we were just going off of what the GATK documentation says (which I now appreciate is NOT the standard). In that case ID is <machine>.<lane>; SM corresponds to the actual sample individual/source; LB corresponds to the actual sample prep; and PU corresponds to <machine>.<lane>.<barcode>

I take your point that @sm doesn't make much sense - my understanding was that GATK would use this as a group identifier for recalibration purposes in the absence of something more specific (eg PU). We do indeed have some cases where we have multiple read groups within a sample, which in our case would be differentiated based on both the ID (the read groups came from different lanes) and PU.

Our issue comes from GATK defining ID as <machine id>.<lane>. This implies that different samples could have the same RG ID (in our case in different bam files corresponding to different samples) if they were multiplexed in such a way that different samples were processed in the same lane. In this case being able to use something like PU would allow different read entities to be uniquely differentiated.

I know my discussion of GATK's practice isn't especially relevant to freebayes (and is probably frustrating), and the approach you've taken seems to make more sense TBH. We're already in the process of re-naming our IDs to work with freebayes so it's not a huge issue by any means. My intention was to just add my support to the list of other users who have encountered the same issue/circumstances.

@ZhenyuZ
Copy link

ZhenyuZ commented May 10, 2019

Similar issue here.

I totally agree that freebayes is implemented correctly, and these are data issues as the BAM file do not satisfy the uniqueness requirement in SAM Spec. However, in the real world, people generated data with these violations, and it is very difficult to fix read group ID in BAMs.

I am wondering if it is possible to implement a feature that can allow users to specify uniqueness of read group, that could maximize the compatibility of freebayes with those imperfect data input. just my 2 cents

@github-actions
Copy link

This issue is marked stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Dec 11, 2020
@github-actions
Copy link

This issue was closed for lack of activity. Feel free to re-open if someone feels like working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants