Concordance measures for regions #821

CBeelen · 2022-03-11T22:11:11Z

When analysing the differences between the previous release and the newest release, we found a couple of samples where bad alignments led to missing parts at the start or end of the region. Upon further investigation, we noticed that the alignment to the region looked really poor in general, so we want to evaluate how well a sampled region matches its coordinate reference and its best blast match. To do this, we will create a new output that shows the concordance between the sampled region and the coordinate reference as well as the best blast match, respectively.

To do:

For the coordinate reference concordance, step through the entries for nuc.csv and count the matches of the MAX consensus to the coordinate reference. Ignore all indels.
For the seed concordance, try using the alignment of the consensus to the coordinate regions to figure out query positions for each region, and use these to step through the consensus-seed alignment.
For the seed concordance, align each contig / consensus (for remapped) to the best blast match. Then, align the blast match to the coordinate reference. Use these mappings to identify the chunk of the blast match and the contig that correspond to the region coordinates, respectively. Iterate through all the alignment matches within the region, ignoring indels, and count how many nucleotides within the matches are concordant.
If the above method works well, the region coordinates within the individual seeds can be pre-computed instead of calculating them from scratch each time.

CBeelen · 2022-03-30T00:05:31Z

Instead of aligning the best blast match to the coordinate reference, I first tried using the information we already have about the region alignments. Using the coordinate region alignments, we can find the query start and end positions that correspond to each region, and we can use those to step through the query-seed alignment and count the concordance. However, we run into trouble if a part of the region did not align to the query. Instead, I'll now go with our original idea of aligning the seed to the coordinate reference region and thus figuring out the seed start and end positions for each region. This can be done either in nucleotide space or in amino acid space (potentially leading to a better alignment, but we have to be careful about the skipped position, and align all three possible reading frames for the seed).

CBeelen mentioned this issue Mar 30, 2022

Score alignments? #826

Open

3 tasks

CBeelen mentioned this issue Apr 8, 2022

Concordance #830

Merged

CBeelen closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concordance measures for regions #821

Concordance measures for regions #821

CBeelen commented Mar 11, 2022 •

edited

CBeelen commented Mar 30, 2022

Concordance measures for regions #821

Concordance measures for regions #821

Comments

CBeelen commented Mar 11, 2022 • edited

CBeelen commented Mar 30, 2022

CBeelen commented Mar 11, 2022 •

edited