Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concordance measures for regions #821

Closed
4 tasks done
CBeelen opened this issue Mar 11, 2022 · 1 comment
Closed
4 tasks done

Concordance measures for regions #821

CBeelen opened this issue Mar 11, 2022 · 1 comment

Comments

@CBeelen
Copy link
Contributor

CBeelen commented Mar 11, 2022

When analysing the differences between the previous release and the newest release, we found a couple of samples where bad alignments led to missing parts at the start or end of the region. Upon further investigation, we noticed that the alignment to the region looked really poor in general, so we want to evaluate how well a sampled region matches its coordinate reference and its best blast match. To do this, we will create a new output that shows the concordance between the sampled region and the coordinate reference as well as the best blast match, respectively.

To do:

  • For the coordinate reference concordance, step through the entries for nuc.csv and count the matches of the MAX consensus to the coordinate reference. Ignore all indels.
  • For the seed concordance, try using the alignment of the consensus to the coordinate regions to figure out query positions for each region, and use these to step through the consensus-seed alignment.
  • For the seed concordance, align each contig / consensus (for remapped) to the best blast match. Then, align the blast match to the coordinate reference. Use these mappings to identify the chunk of the blast match and the contig that correspond to the region coordinates, respectively. Iterate through all the alignment matches within the region, ignoring indels, and count how many nucleotides within the matches are concordant.
  • If the above method works well, the region coordinates within the individual seeds can be pre-computed instead of calculating them from scratch each time.
@CBeelen
Copy link
Contributor Author

CBeelen commented Mar 30, 2022

Instead of aligning the best blast match to the coordinate reference, I first tried using the information we already have about the region alignments. Using the coordinate region alignments, we can find the query start and end positions that correspond to each region, and we can use those to step through the query-seed alignment and count the concordance. However, we run into trouble if a part of the region did not align to the query. Instead, I'll now go with our original idea of aligning the seed to the coordinate reference region and thus figuring out the seed start and end positions for each region. This can be done either in nucleotide space or in amino acid space (potentially leading to a better alignment, but we have to be careful about the skipped position, and align all three possible reading frames for the seed).

@CBeelen CBeelen mentioned this issue Mar 30, 2022
3 tasks
@CBeelen CBeelen mentioned this issue Apr 8, 2022
@CBeelen CBeelen closed this as completed Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant