Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental tool that aligns sequencing reads to set of haplotypes / templates #8305

Merged

Conversation

ilyasoifer
Copy link
Collaborator

This tool is useful for alignment of the reads to a large set of templates that are very close in sequence. In this case the alignment is close to the problem that the HaplotypeCaller is solving - of calculating the likelihood of the read compared to a set of close haplotypes.
The tool outputs read x haplotype matrix for a set of reads and haplotypes using either FlowPairHMM or FlowBasedAlignmentEngine

* FlowPairHMM and FlowBasedAlignment (FBA), but can be easily extended.
* At present, there are two output formats that can be specified using parameter --output-format: extended and concise.
* The extended format contains a readxhaplotype matrix that shows alignment score of each read versus each haplotype.
* Condensed format will following columns (not nessarily in this order) for each processed read:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Condensed format will following columns (not nessarily in this order) for each processed read:
* Condensed format contains the following columns (not nessarily in this order) for each processed read:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the columns are in any order? It looks like they're in a fixed order from the writer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*
* <h3>Usage examples</h3>
* <pre>
* java -Xms3000m -jar ~{gitc_path}/GATK_ultima.jar FlowPairHMMAlignReadsToHaplotypes \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* java -Xms3000m -jar ~{gitc_path}/GATK_ultima.jar FlowPairHMMAlignReadsToHaplotypes \
* gatk FlowPairHMMAlignReadsToHaplotypes \

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* <pre>
* java -Xms3000m -jar ~{gitc_path}/GATK_ultima.jar FlowPairHMMAlignReadsToHaplotypes \
* -H ~{haplotype_list} -O ~{base_file_name}.matches.tsv \
* -I ~{input_bam} "--flow-use-t0-tag -E FBA \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* -I ~{input_bam} "--flow-use-t0-tag -E FBA \
* -I ~{input_bam} --flow-use-t0-tag -E FBA \

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* -H ~{haplotype_list} -O ~{base_file_name}.matches.tsv \
* -I ~{input_bam} "--flow-use-t0-tag -E FBA \
* --flow-fill-empty-bins-value 0.00001 --flow-probability-threshold 0.00001 \
* --flow-likelihood-optimized-comp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* --flow-likelihood-optimized-comp"
* --flow-likelihood-optimized-comp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

private static final Logger logger = LogManager.getLogger(FlowPairHMMAlignReadsToHaplotypes.class);

@Argument(fullName = "haplotypes", shortName = "H", doc="Fasta file with haplotypes")
public GATKPath haplotypesFa;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can all be public static final

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java does not like that - it says that the variable may not have been initialized

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java does not like that - it says that the variable may not have been initialized

@Argument(fullName = "output-format", doc="concise or expanded output format: " +
"expanded - output full read x haplotype, concise - output for each read best haplotype and " +
"score differences from the next best and the reference haplotype", optional=true)
public String outputFormat = "expanded";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be an enum. Or since there's only two options you could make it a boolean with "expanded" as the default (eg. Boolean conciseOutputFormat = false)

Same comment below for the aligner argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed!

public FlowBasedAlignmentArgumentCollection fbargs = new FlowBasedAlignmentArgumentCollection();



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@ArgumentCollection



LikelihoodEngineArgumentCollection likelihoodArgs = new LikelihoodEngineArgumentCollection();
SAMFileHeader sequenceHeader;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make these private static final

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

//in general reads are already trimmed to the haplotype starts and ends so diff_left <= 0 and diff_right <= 0
fbr.applyBaseClipping(Math.max(0, diffLeft), Math.max(diffRight, 0), false);
}
// final int haplotypeStart = processedHaplotypes.get(0).getStart();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be uncommented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually is not necessary in the context of the GATK

@ilyasoifer
Copy link
Collaborator Author

@meganshand - all fixed, PTAL
Thanks!

@meganshand meganshand merged commit 0ff9b91 into broadinstitute:master May 25, 2023
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants